I hypothesize that all country music sounds the same because of similiarities in the components of songs. Specifically, because many country music songs are composed with vocals and guitar, the way the guitar is played and the kinds of lyrics used in the songs may be key components in determining whether or not all country music does sound similar. The way the guitar is played can be determined by the chord progression, and I hypothesize that many country songs all use the same chord progession, whereas in other genres of music, there will be a greater variation of the chord progressions used in songs. I also suspect that the lyrics have some similarity in the words used and a naive-Bayes analysis may be suitable.

The complete code can be found on github here.


A chord progression is the different chords used in songs, as well as the order in which they are used. For example Pearl Jam's 'Elderly woman...' uses a chord progression D C G G for the much of the verse and chorus. Both the notes used and their respective order is needed. The assumption that the guitar plays the characteristic 'sound' of the song and can be infered from the data gained from the chord progression of the guitar.

A python webscraper mines guitar tabluture from the internet, specifically ultimate-guitar.com and reads them to obtain the chords used in songs. The python code retrives the chords that users of the site have entered. Because the guitar tablature is user submitted, only 5 star rated tablatures are used in standard tuning, for greatest accuracy. A 5 star rating is achieved when all users of the tablature rate the tab as 5-star. Mining the tablature for data on the various chord progressions played in songs, the end goal would be analyze different genres of music to show that country music shows less variation in the chords and chord progressions used compared to other genres of music.

Guitar tablature Example

See here ▼

Below is an example of typical guitar tablature from ultimate-guitar.com

Band- Metallica
Song- TuesdAYS gONE

  A5                   E5                   F#5                   D
      x 0 2 2 x x          0 2 2 x x x          2 4 4 x x x          x x 0 2 3 2

         Dsus4                Dsus2                  G                   F#m
      x x 0 2 3 3          x x 0 2 3 0          3 2 0 0 3 3          2 4 4 2 2 2

      3 5 5 4 3 3

Gtr I (Eb Ab Db Gb Bb Eb) - 'acoustic'
Gtr II (Eb Ab Db Gb Bb Eb) - 'dobro'
Gtr III (Eb Ab Db Gb Bb Eb) - 'acoustic'
Gtr IV (Eb Ab Db Gb Bb Eb) - 'acoustic'

  Slowly H.=50
                   A5                                 E5
  Gtr I
                    |           |                      |    ||
                    /           /                      /    //

  Gtr II
| Gtr III
|                                                      ~~
| Gtr IV

                    F#5                                   D
  |        |  |      |         ||    |        |     |     |  | || |
  /        /  /      /         //    /        /     /     /  / // /

                        ~~~~~~~                           ~
|                    ~~                                   ~
|                    PM---------------------------|


The python webscraper built goes through all the tabs of a particular genre and grabs all the chords played in the song. The chord progression is obtained by simply finding the 4-chord combination that is most common. This chord progression is then added to an SQL database, along with the artist name, song title genre and subgenre. An example of sub-genre to a genre is bluegrass, which is a sub-genre to country. This relationship was taken from exploring how ultimate-guitar.com specifies. Once database of all the chord progressions is obtained, a subsection can be queried from the list, i.e. accouding to subgenre, and it is sorted and plotted. A cumulative distribution is also obtained from the sorted list and is useful for determining global variation.

Python script - Chords Webscraper

See here ▼

# -*- coding: utf-8 -*-
Created on Sun Mar 20 12:12:07 2016
Updates on Sun Apr 17
Added SQL database on July 10th

@author: matt
# = Import packages
import requests
from lxml import html
from lxml.etree import tostring
import sqlite3
import collections
import re

# = Helper Functions =========================================================

def removeTags(string):
    Function to remove html tags
    return re.sub('<[^<]+?>', '', string)

def getArtist(webTree):
    Function to get the artist name from html
    ArtistLoc = webTree.find_class('t_autor')
    ArtistStrList = [tostring(track, with_tail=False) for track in list(ArtistLoc[0].iter('a'))]
    Artist = removeTags(ArtistStrList[0])
    return Artist

def getTrack(webTree):
    Function to get the track name from html
    title = webTree.find_class('t_title')
    Tracklist = [tostring(track, with_tail = False) for track in list(title[0].iter('h1'))]
    Track = removeTags(Tracklist[0])
    return Track

def getChordsList(webTree):
    Function to get a list of chords from html
    tabContentClass = webTree.find_class('js-tab-content')
        chordsList = list(tabContentClass[0].iter('span')) # starts at element 0
        chords = [tostring(chord, with_tail = False).strip('') for chord in chordsList]
    except IndexError:
        chords = []
    return chords

def getChordsArtistTrack(url):
    Function to get chords from a webpage in str format
    webPage = requests.get(url)
    webTree = html.fromstring(webPage.content)
    Artist = getArtist(webTree)
    Track = getTrack(webTree)
    Chords = getChordsList(webTree)
    return Artist, Track, Chords

def getGenreTree(genrekey, page):
    Function to get xml tree given the genre
    if type(page) == int:
        page = str(page)
    theURL = 'https://www.ultimate-guitar.com/search.php?type[2]=300&type2[0]=40000&rating[4]=5' \
        + '\&tuning[standard]=standard&genres[0]=' + str(genrekey)+ '&page=' + page \
        + '&view_state=advanced&tab_type_group=text&app_name=ugt&order=myweight'

    pageBand = requests.get(theURL)
    return html.fromstring(pageBand.content)

def getChordProg(chordList, progLen = 4):
    Function to get the most common chord progression in a list
    progs4 = [''.join(chordList[i:i+progLen]) for i in range(len(chordList)-progLen)]
        chordProg = collections.Counter(progs4).most_common(1)[0][0]
    except IndexError: # If there are no chords
        chordProg = ''
    return chordProg

def getGenreDict():
    Function to get all the genres and output to a dict of dicts where the keys
    to the outer dict is the main genre the values are dicts, with the inner
    values being the sub genre
    url = 'https://www.ultimate-guitar.com/advanced_search.html'
    pageUrl = requests.get(url)
    webTree = html.fromstring(pageUrl.content)
    Main = webTree.find_class('b')
    MainGenreList = list(Main[2].iter('optgroup'))
    GenreDict = {}
    for Genre in MainGenreList:
        GenreStr = tostring(Genre, with_tail = False)
        g1 = str.split(re.sub('
        MainGenre = g1[0]
        GenreInd_complete = re.findall(r'\d+',GenreStr)
        if len(GenreInd_complete)>4:
            GenreInd = GenreInd_complete[2:-2:2]
            GenreInd = GenreInd_complete[2]
        subGenreDict = dict(zip(g1[1:], GenreInd[1:]))
        if bool(subGenreDict):
            GenreDict[MainGenre] = subGenreDict
            GenreDict[MainGenre] = dict([(MainGenre, GenreInd[0])])
    return GenreDict
# ============================================================================

def main()
    # = Set up SQL
    conn = sqlite3.connect('chordProgdb2.sqlite')
    cur = conn.cursor()
    Uncomment out the bottom section is want to reset the tables


    CREATE TABLE Artist (
        name    TEXT UNIQUE

    CREATE TABLE ChordProg (
        prog   TEXT UNIQUE

    CREATE TABLE Genre (
        name    TEXT UNIQUE


        title TEXT,
        artist_id INTEGER,
        chordProg_id  INTEGER,
        genre_id INTEGER,
        subgenre_id INTEGER

    # ============================================================================

    allGenresDict = getGenreDict()
    #to find all genres: allGenresDict.keys()

    mainGenre = 'Alternative'
    genresDict = allGenresDict[mainGenre]

    cur.execute('''INSERT OR IGNORE INTO Genre (name)
        VALUES ( ? )''', (mainGenre, ) )
    cur.execute('SELECT id FROM Genre WHERE name = ? ', (mainGenre, ))
    main_genre_id = cur.fetchone()[0]

    page = '1' # start at page 1, in str format as will be added to html str

    for genre, genrekey in genresDict.iteritems():
        # = Get xml tree for first page
        tree1 = getGenreTree(genrekey, page)
        pages = tree1.find_class('paging')
            maxPage = len(list(pages[0].iter('a'))) # see what the max number of pages is
            print('Max Page: '+ str(maxPage))
        except IndexError:
        # = Grab song links on the first page
        songs = tree1.find_class('song result-link')
        songLinks = []
        for i in songs:

        # = Iterate through the remaining pages and add song links
        for i in range(maxPage -1):
            looppage = i + 2
            looptree = getGenreTree(genrekey, str(looppage))
            loopsongs = looptree.find_class('song result-link')
            for song in loopsongs:

        print('No of tabs: ' + str(len(songLinks)))

        cur.execute('''INSERT OR IGNORE INTO Genre (name)
            VALUES ( ? )''', (genre, ) )
        cur.execute('SELECT id FROM Genre WHERE name = ? ', (genre, ))
        subgenre_id = cur.fetchone()[0]

        progLen = 4
        for i in songLinks:
            Artist, Song, Chords = getChordsArtistTrack(i)
            chordProg = getChordProg(Chords, progLen)

            cur.execute('''INSERT OR IGNORE INTO Artist (name)
                VALUES ( ? )''', ( Artist, ) )
            cur.execute('SELECT id FROM Artist WHERE name = ? ', (Artist, ))
            artist_id = cur.fetchone()[0]

            cur.execute('''INSERT OR IGNORE INTO ChordProg (prog)
                VALUES ( ? )''', (chordProg, ) )
            cur.execute('SELECT id FROM ChordProg WHERE prog = ? ', (chordProg, ))
            chordProg_id = cur.fetchone()[0]

            cur.execute('''INSERT OR REPLACE INTO Song
                (title, artist_id, genre_id, subgenre_id, chordProg_id)
                VALUES ( ?, ?, ?, ?, ? )''',
                ( Song, artist_id, main_genre_id, subgenre_id, chordProg_id, ) )

if __name__=='__main__':

Database Entity Relationship Diagram

See here ▼

SQL Query Example

Accessing top 10 most frequent Americana chords


  • Joins the appropriate tables in the database.

  • Filter by subgenre.

  • Group by chord progression.

  • Order by their total count, descending.

  • Select appropriate columns.

See here ▼

# -*- coding: utf-8 -*-
Spyder Editor

Created on July 10th

import sqlite3
conn = sqlite3.connect('chordProgdb2.sqlite')
cur = conn.cursor()

def main():
  SELECT chord.prog, COUNT(*)
  FROM Song AS master
  LEFT JOIN Genre AS subgenre_tbl ON subgenre_tbl.id = master.subgenre_id
  LEFT JOIN Genre ON genre.id = master.genre_id
  LEFT JOIN ChordProg AS chord ON chord.id = master.chordProg_id
  WHERE subgenre_tbl.name = 'Americana'
  GROUP BY chord.prog
  LIMIT 10

  top10Americana= cur.fetchall()

if __name__=="__main__":

out: [(u'DGDG', 7), (u'CGAmC', 6), (u'GCGC', 6), (u'GEmGEm', 6), (u'B7EAE', 5), (u'CFCF', 5),
(u'EEEE', 5), (u'DADA', 4), (u'DEmCG', 4), (u'DGCG', 4)]

Though the data is not very big in size it may be a good question to ask why the data is even in a SQL database in the first place, since the total data is only a few megabytes, and will easily fit in memory. My reason is that it is a good habit I have developed, to load, save, and access SQL tables in databases. Moreover, adding to the database can easily be done and is standaradized using the entity relationship diagram (ERD), data can be added from other data sources, and is the database can be incorporated to other projects easily.


We can see that the many of the chord progressions used in country music include the chords G, C, and D. And notably the highest used chord proression, "GDCG", is used more than 50% more that the following most common chord porgression. However, if we look at the variation of chord progressions, seen in the cumulative distribution function, country music does not appear to show any less variation than other genres.

If we look at the count of each chords, i.e., in all the songs count the total instance of particluar chords, this may tell us more. This can be done by aggregating and summing all the chords from all the tabs.

We find that Chords involving variations of G, C, and D are very popular in country music and those three chords account for about 45% of the total chords, if we look at the top 8 chords (G, C, D, A, F, Am, Em, E) they account for almost 75%. For context there are over 800 various chord combinations (see: here). So we can see how these 8 chords and their various combinations can lead to music that may sound similar.

When we compare the variation to other genres we can see that country music stands out with little variation in the chords used. From the previous cumulative distribution function we found that the variation of chord progressions is on par with other genres, but from this cumulative distribution function we find that the variation in actual chords is small. Country music uses the same chords in many different combinations. This may explain why many country music songs sound similar.


One issue I came across when scraping the website is that ultimate-guitar.com only allows 500 search results, yet there are over 20,000 5-star rated chord tabs for some of the genres, separating by sub genre, or even by artist may lead to a greater number of tabs to be processed and so trends may become clearer. This 500 search limit is probably for the user expereince, since I'm assuming no one want to search manually through 20,000 results. 500 a happy medium that is small enough to search manually through, yet large enough to provide good and varied results in the search.

    "Three chords and the truth - that's what a country song is"

Country Music Lyrics

Getting Country Music Lyrics

The hypothesis that country music sounds similar due similarity of the lyrics can also be examined. The country music lyrics are also scraped from the website www.anycountrymusiclyrics.com. The scraper goes through all the songs on the website, and pulls the main text, which are the lyrics. Following, the lyrics are cleaned up, by converting them all to lowercase, removing punctuation and any extraneous spaces, and put into a list of lists, the outer list is for the various songs, and the inner list contains the words for each song, or a complete corpus. The list is then saved using the python pickle function, since the scraping takes about 35 minutes to run in total. This process was in general easier to scrape than the guitar tablature since there is more of a standard to submitting lyrics opposed to guitar tablature.

Python script - Country Lyric Webscraper

See here ▼

# -*- coding: utf-8 -*-
Created on Sat Jul  2 16:49:11 2016

@author: matt-666
import urllib
from bs4 import BeautifulSoup, SoupStrainer
import string
import re
import pickle

# = Helper Functions =========================================================
def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('', str(element.encode('utf-8'))):
        return False
    return True

def strClean(strList):
    newRes2 = []
    for i in range(len(newRes)-1):
        if newRes[i+1]!='\n' and newRes[i]!='\n':
            t1 = ' '.join(str(newRes[i]).split()) # removes the \n, which may be needed
            t1 = t1.translate(string.maketrans("",""), string.punctuation)
    return newRes2

# ================================================================

# = Web scraper ==================================================

def main():
  baseurl = urllib.urlopen('http://www.anycountrymusiclyrics.com/').read()

  # = Get all links by artist first letter
  artistAlphaLinks = []
  for link in BeautifulSoup(baseurl, parseOnlyThese=SoupStrainer('a')):
      if link.has_attr('href') and '.com/artist' in link['href']:
          print link['href']

  # = Get all links of artists
  artistLinks = []
  for alphaLink in artistAlphaLinks:
      newurl = urllib.urlopen(alphaLink).read()
      for link in BeautifulSoup(newurl, parseOnlyThese=SoupStrainer('a')):
          if link.has_attr('href') and '.com/show/artist' in link['href']:
              print link['href']

  # = Get all links of songs
  songLinks = []
  for artistLink in artistLinks:
      newArtisturl = urllib.urlopen(artistLink).read()
      for link in BeautifulSoup(newArtisturl, parseOnlyThese=SoupStrainer('a')):
          if link.has_attr('href') and '.com/lyrics/' in link['href']:
              print link['href']

  # = Get the lyrics of all songs
  songLyrics = []
  for songLink in songLinks:
      songSoup = BeautifulSoup(urllib.urlopen(songLink).read())
      songData = songSoup.findAll(text=True)
      songResult = filter(visible, songData)
      songLyrics.append(strClean(songResult)) ## Extend -> one long list, append -> list length of songs

  # = dump in pickle file
  pickle.dump(songLyrics, open("songLyrics.p","wb"))

if __name__=="__main__":

The web scraper picks up 443 song lyrics total, this seems like a low number only because of that is the total number of lyrics on the website, the scaper picks up all the lyrics with 100% success rate. The lyrics are then processed by sorting the lyrics into a dictionary. Specifically I count the instances of the words, and also create a nested dictionary, with the key to the outer dictionary being the lyric, and the values for each key of the outer dictionary being a dictionary with the keys being the possible words that come after the lyric, and the value being the number of instances that has occured in the total corpus.

For example if we look at the value for the key: 'hotel', we get

{'room': 3, 'wall': 1, 'bed': 1, 'sign': 1, 'with': 1}

This means that or all the times th word 'hotel' appears in a lyric, 3 times it was follwed by 'room', once it was followed by 'wall' , once it was followed by 'bed', once it was followed by 'sign', and once with 'with'.

This analysis can be used to compute n-grams, as well as for feature-engineering in naive-Bayes classification, for example to classify various genres of music, via their lyrics.

Python script - Country Lyric Plot

See here ▼

# -*- coding: utf-8 -*-
Created on Sat Jul  2 16:49:11 2016

@author: matt-666

import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
import random
import pickle

# = Post-scrape ===============================================================

def main():
  # = grab from pickle file
  songLyrics = pickle.load(open("songLyrics.p", "rb"))

  # = compress into onle long list
  flatSongLyrics = [item for sublist in songLyrics for item in sublist]

  # = remove stopwords
  flatSongLyrics_noStopwords = [word for word in flatSongLyrics if word not in (stopwords.words('english'))]

  # = the word 'chorus' remains, so we remove
  flatSongLyrics_noStopwords[:]  = [x for x in flatSongLyrics_noStopwords if x != 'chorus']

  #Count all instances of words that are not stop words
  wordCount = Counter(flatSongLyrics_noStopwords)

  # = sort dictionary by value
  sortedKeys, sortedVals = [], []
  for key, value in sorted(wordCount.iteritems(), key=lambda (k,v): (v,k)):
  # = have to reverse since the sort is in ascending order
  sortedKeys.reverse(); sortedVals.reverse()

  # = Plot in bar chart
  plt.figure(44); plt.clf(); res = 40
  plt.bar(range(res), sortedVals[:res]/sum(sortedVals)*100., align='center')
  plt.xticks(range(res), sortedKeys[:res], rotation = 60)
  plt.xlabel('Lyric'); plt.ylabel('Count (%)')

if __name__=="__main__":


In order to understand the context all the stop words have been removed, taken from the natural language toolkit, and a more accurate description of the language of contry music is achieved. We find that the just 40 words are needed make up 22.48% of all lyrics, and only about 1000 words are needed to account for 80% of all words in country music. If we look at type of language of the words from the top 40 words, they are almost all heartbreak/love kind of songs, or could be used in that context. This idea is confirmed when we look at the Top 500 country songs, we can see that many songs have title such as "I Don't Call Him Daddy" by Doug Supernaw, or "We Danced" by Brad Paisley, and we might be able to infer, after looking at the data our lyrics analysis, what kind of lyrics these songs may contain.

Generating Country Music Lyrics

We can take this approach further, and use a naive-Bayes approach to generate our own country music lyrics. By taking a word, we can use the nested dictionary to see what likely words follow that word. We can continue this until we get one lyric, which consists of 8-10 words. Typically, a chorus or verse consists of 4-6 lyrics. From this we can create a song. This process is similar to Markov chains, a random mathematical system in which the outcome of a state is based on the current state a system is in, and not the one that came before it. Here, the predictive text generator will only offer suggestions based on the last word entered. However this process differs from Markov chains, in that the user has to enter a word to begin with, or a random one is generated from the dictionary.

Python script - Country Song Generator

See here ▼

# -*- coding: utf-8 -*-
Created on Sat Jul  3 19:49:21 2016

@author: matt-666

import random
import pickle

# = Helper Functions =========================================================

def WeightedPick(d):
    r = random.uniform(0, sum(d.values()))
    s = 0.0
    for k, w in d.iteritems():
        s += w
        if r < s: return k
    return k

def createVerse(mydict, verseLen, lyricLen):
    robo_lyric = [random.choice(mydict.keys())] # initiliaxe with random word
    for j in range(verseLen):
        for i in range(lyricLen):

    for i in range(verseLen):
    return(' '.join(robo_lyric))

# =============================================================================

# = Post-scrape ===============================================================

def main():
  # = grab from pickle file
  songLyrics = pickle.load(open("songLyrics.p", "rb"))

  # = Country-Robo lyric machine ===============================================
  # = Uses a naive-Bayes approach to choose the likely next word to create a lyric

  # = Create dictionary of dictionary to see what words come are likely to come after each one
  lyricDict= {}
  for song in songLyrics:
      for word in range(len(song)-1):
              lyricDict.setdefault(song[word], {})[song[word+1]] += 1
          except KeyError: # if key doesnt exist
              lyricDict.setdefault(song[word], {})[song[word+1]] = 1

  lyricLen = 10 # number of words in lyric
  chorusLen = 4  # number of lyrics in chorus/verse
  random.seed(120) # for repetibility, set alternate seed for different song

  chorus = createVerse(lyricDict, chorusLen,lyricLen)
  verse1 = createVerse(lyricDict, chorusLen,lyricLen)
  verse2 = createVerse(lyricDict, chorusLen,lyricLen)
  mySong = verse1 + '\n\r\n' + chorus + '\n\r\n' + verse2 + '\n\r\n' + chorus

if __name__=="__main__":

# Actual Lyric '27 seconds on a party bone is the party for me' XD XD XD

A typical song produced, with both chorus and verse having 4 lyrics, and 10 words per lyric, in format verse 1, chorus, verse 2, chorus.

 teen havin fun there are times there’s at school football
 team and there was was custommade to do nothing before
 anywhere that it feels its the things that lullaby
 of the chance to sing stuff we ain’t funny more try

 couldve lied but when i was thinking bout as long
 as i was in her name in love me that
 looks like a different then well ive been grand daddy
 ain’t muddy and shes a little moments like hes alright now

 check every day i think its funny well this worlds
 too much like there i’d like deja vu because when
 im outta here forever in some people take it to
 the second grade i was in the caribbean sun well that

 couldve lied but when i was thinking bout as long
 as i was in her name in love me that
 looks like a different then well ive been grand daddy
 ain’t muddy and shes a little moments like hes alright now

To implement this better I have impletented a training model using recurrent neural networks, which can be found on my blog post here, that more accurately recreates the gramatical syntax of country music lyrics.


To conclude, we have seen that while chord progressions show as much variation as other genres, there is significantly less variation in the actual chords used in the chord progressions that may lead to the reason I believe that many country music sounds similar. Also, the language used in country music has little variation and the theme of the songs is also similar, and generated lyrics, if we take grammar out, is pretty convincing, and subscribes to the successful "formula" that has dominated country music.

In the end it's not just me that thinks all country music sounds the same, as Buzzfeed reports, song writer Gregory Todd shows that different country songs, from differnt artists, can be mixed the result can be seemless.