• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

Python: Check Wiki-references for citation template

Contents

  • Python: Check Wiki-references for citation template
    • Examples
    • Downloading wiki source code

Wikipedia articles are full of references. Those references should be formatted the same way. It is much easier to use a template for citations than trying to guess the right way how to cite. Unfortunately most Wikipedia users don't know the Template:Citation.

So I try to fix all manual styled citations when I edit an article. Doing this manually is quite time intensive. This is the reason why I wrote a little Python-script.

Examples

<ref>[http://peter.mapledesign.co.uk/weblog/archives/python-is-slow Python is... slow?] December 21st, 2004 &mdash; Peter Bowyer&rsquo;s weblog]</ref>

should be

<ref>{{Citation |url=http://peter.mapledesign.co.uk/weblog/archives/python-is-slow |title=Python is... slow? |accessdate=September 24, 2011}}</ref>
<ref>[http://www.nongnu.org/pydbc/ Contracts for Python], PyDBC</ref>

should be

<ref>{{Citation |url=http://www.nongnu.org/pydbc/ |title=Contracts for Python |accessdate=September 24, 2011}}</ref>

So, all that has to be done is

  1. Finding all -Tags without a template in one article
  2. Trying to find the URL of this reference
  3. Filling out as much as possible for the user
  4. Asking the user for missing information
  5. Returning the new article wiki source code

Downloading wiki source code

Wikipedia offers an API for accessing the needed information. I will use this API and Pythons optparse, BeautifulSoup and HTMLParser to get the raw wiki text in UTF-8 encoding:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2
from optparse import OptionParser
from BeautifulSoup import BeautifulStoneSoup
import HTMLParser

parser = OptionParser()
parser.add_option(
    "-l",
    "--lemma",
    dest="lemma",
    default="Python",
    type="string",
    help="Which lemma should be checked?",
)
parser.add_option(
    "-m",
    "--language",
    dest="language",
    default="en",
    type="string",
    help="Which langauge should be used " + "(english wiki, geman wiki, ... )",
)
parser.add_option(
    "-v",
    "--verbose",
    action="store_true",
    dest="verbose",
    default=False,
    help="Show more information.",
)
(options, args) = parser.parse_args()


def load(lemma, language="en", format="xml"):
    """ Get the Wikipedia Source Text (not the HTML source code)

        format:xml,json, ...
        language:en, de, ...

        Returns None if page doesn't exist
    """
    lemma = lemma.replace(" ", "_")
    url = (
        "http://"
        + language
        + ".wikipedia.org/w/api.php"
        + "?action=query&format="
        + format
        + "&prop=revisions&rvprop=content"
        + "&titles="
        + lemma
    )
    request = urllib2.Request(url)
    handle = urllib2.urlopen(request)
    text = handle.read()
    if format == "xml":
        soup = BeautifulStoneSoup(text)
        rev = soup.rev
        if rev != None:
            text = unicode(rev.contents[0])
            text = HTMLParser.HTMLParser().unescape(text)
            text = text.encode("utf-8")
        else:
            return None
    return text

We are now able to access the needed information. Now we need to get the references without templates. To do so, I will use the Python re module:

import re


def getRef(page):
    """ Get all references without templates """
    pattern = "(<ref>\[.+?</ref>)"
    prog = re.compile(pattern)
    m = re.findall(pattern, page)
    return m

Now the single references have to get parsed and the user has to confirm or edit the results:

import readline
from datetime import date


def rlinput(prompt, prefill=""):
    """ Promt the user for input, but prefill it. """
    readline.set_startup_hook(lambda: readline.insert_text(prefill))
    try:
        return raw_input(prompt)
    finally:
        readline.set_startup_hook()


def improve(references, page):
    """ Try to guess the right formatation for each reference and ask
        the user to confirm or edit the formatation of the reference. """

    urlPattern = "http.+? "
    urlPatternCompiled = re.compile(urlPattern)

    today = date.today()
    accessdate = str(today.strftime("%B %d, %Y"))

    for refOld in references:
        url = re.findall(urlPatternCompiled, refOld)
        refNew = (
            "<ref>{{Citation "
            + "|url="
            + str(url[0])
            + "|title= "
            + "|accessdate="
            + accessdate
            + "}}</ref>"
        )

        refNew = rlinput(refOld + " (old):\n", refNew)
        page = page.replace(refOld, refNew)
        print("")

    return page

Here is the full script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2
from optparse import OptionParser
from BeautifulSoup import BeautifulStoneSoup
import HTMLParser

import re

import readline
from datetime import date

parser = OptionParser()
parser.add_option(
    "-l",
    "--lemma",
    dest="lemma",
    default="Python",
    type="string",
    help="Which lemma should be checked?",
)
parser.add_option(
    "-f",
    "--file",
    dest="filename",
    help="write corrected wiki source code to FILE",
    metavar="FILE",
    default="wikioutput.txt",
)
parser.add_option(
    "-m",
    "--language",
    dest="language",
    default="en",
    type="string",
    help="Which langauge should be used " + "(english wiki, geman wiki, ... )",
)
parser.add_option(
    "-v",
    "--verbose",
    action="store_true",
    dest="verbose",
    default=False,
    help="Show more information.",
)
(options, args) = parser.parse_args()


def load(lemma, language="en", format="xml"):
    """ Get the Wikipedia Source Text (not the HTML source code)

        format:xml,json, ...
        language:en, de, ...

        Returns None if page doesn't exist
    """
    lemma = lemma.replace(" ", "_")
    url = (
        "http://"
        + language
        + ".wikipedia.org/w/api.php"
        + "?action=query&format="
        + format
        + "&prop=revisions&rvprop=content"
        + "&titles="
        + lemma
    )
    request = urllib2.Request(url)
    handle = urllib2.urlopen(request)
    text = handle.read()
    if format == "xml":
        soup = BeautifulStoneSoup(text)
        rev = soup.rev
        if rev != None:
            text = unicode(rev.contents[0])
            text = HTMLParser.HTMLParser().unescape(text)
            text = text.encode("utf-8")
        else:
            return None
    return text


def getRef(page):
    """ Get all references without templates """
    pattern = "(<ref>\[.+?</ref>)"
    prog = re.compile(pattern)
    m = re.findall(prog, page)
    return m


def rlinput(prompt, prefill=""):
    """ Promt the user for input, but prefill it. """
    readline.set_startup_hook(lambda: readline.insert_text(prefill))
    try:
        return raw_input(prompt)
    finally:
        readline.set_startup_hook()


def improve(references, page):
    """ Try to guess the right formatation for each reference and ask
        the user to confirm or edit the formatation of the reference. """

    urlPattern = "http.+? "
    urlPatternCompiled = re.compile(urlPattern)

    today = date.today()
    accessdate = str(today.strftime("%B %d, %Y"))

    for refOld in references:
        url = re.findall(urlPatternCompiled, refOld)
        refNew = (
            "<ref>{{Citation "
            + "|url="
            + str(url[0])
            + "|title= "
            + "|accessdate="
            + accessdate
            + "}}</ref>"
        )

        refNew = rlinput(refOld + " (old):\n", refNew)
        page = page.replace(refOld, refNew)
        print("")

    return page


if __name__ == "__main__":
    print(
        "If you need more parameters like 'date': "
        + "http://en.wikipedia.org/wiki/Template:Citation#"
        + "Full_citation_parameters"
    )
    page = load(options.lemma)
    references = getRef(page)
    page = improve(references, page)
    f = open(options.filename, "w")
    f.write(page)
    f.close()
    print("Page has been written to %s." % options.filename)

This can be improved in several ways:

  • Checking automatically the title / dead links
  • Trying to find the publication date automatically
  • Skip links with Templade:Dead link
  • Search also for Text

Here is the Wikipedia diff page. My little script seems to work.


Published

Sep 24, 2011
by Martin Thoma

Category

Code

Tags

  • Python 141
  • RegEx 3
  • Wikipedia 6

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor