Analyzing PyPI Metadata

This is part one of a series. See Analyzing PyPI Data - 2 for part two.

PyPI, the Python Package Index, gives a very crappy but simple interface to query metadata about its packages. I scrapped all of the packages metadata. 53,533 packages were scrapped (date: 2015-01-18), because I wanted to see if there is malware on PyPI (related to this question on security.SE).

The database looks like this:

PyPI metadata database schema — PyPI metadata

Exploring the data

When I scapped the data from PyPI, I made all database fields "varchar 255" as there seems to be no information about the possible values. Then I explored the data

name: The longest package name is 80 characters long (Aaaaaaaaaaa...), the shortest packages have only one character.
author: 911× "UNKOWN", 741× empty, 195× "None", 151× "Zope Foundation and Contributors". There are about 22,000 different authors. There are 595 authors who wrote more than 10 packages.
author_email: 2059× "UNKOWN", 932× empty, 323× "[email protected]", 216× "None" and 204× "TODO".
maintainer: 46879× "None", 5507× empty, 17× Paul Boddie.
requires_python: 53467× "None", 3× "UNKNOWN", 2× ">=2.7,!=3.0,!=3.1", 2× ">=2.5", 1× "2.6", 1× ">= 3.3", 1× ">=2.4", 1× ">= 2.7" - that's it. No other values. Seems as if this is pretty much useless.
docs_url: Either begins with http://pythonhosted.org or is empty.

Most active authors

SELECT
    `author`, COUNT(`id`) as `created_packages`
FROM
    `packages`
GROUP BY
    `author`
ORDER BY
    COUNT(`id`) DESC, `author` ASC
LIMIT
    50

gives

	1485
UNKNOWN	1150
None	196
OpenStack	188
Zope Foundation and Contributors	146
MicroPython Developers	139
OpenERP SA	137
Zope Corporation and Contributors	128
RedTurtle Technology	127
Praekelt Foundation	126
Fanstatic Developers	98
Tryton	98
Raptus AG	97
russianidiot	93
hfpython	92
LOGILAB S.A. (Paris, FRANCE)	88
BlueDynamics Alliance	87
Ralph Bean	83
JeanMichel FRANCOIS aka toutpt	69
Bart Thate	68

The total number of authors is 28 183 (06.12.2015):

SELECT COUNT(DISTINCT `author`) AS `total_authors` FROM `packages`

Maximum Length

I guess many values are stored on PyPI as Varchar(255). To check if I might have missed some values, I checked which entry was the longest one for all columns. I did this with

SELECT `name`, LENGTH(`name`) FROM `packages` ORDER BY LENGTH(`name`) DESC

Column	Maximum Length	Entry
name	80	Aaaaaaaaaaaaaaaaaaa-aaaaaaaaa-aaaaaaasa-aaaaaaasa-aaaaasaa-aaaaaaasa-bbbbbbbbbbb
author	248	Michael R. Crusoe, Greg Edvenson, Jordan Fish, Adina Howe, Luiz Irber, Eric McDonald, Joshua Nahum, Kaben Nanlohy, Humberto Ortiz-Zuazaga, Jason Pell, Jared Simpson, Camille Scott, Ramakrishnan Rajaram Srinivasan, Qingpeng Zhang, and C. Titus Brown
author_email	215	[email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
maintainer	63	Panagiotis Liakos, Katia Papakonstantinopoulou, Michael Sioutis
maintainer_email	67	[email protected], [email protected], [email protected]
requires_python	17	>=2.7,!=3.0,!=3.1
platform	231	Natural Language :: English License :: OSI Approved :: Apache Software License Environment :: Console Development Status :: 2 - Pre-Alpha Intended Audience :: Developers Programming Language :: Python :: 2.7 Topic :: Database
version	73	[In Progress] Python middle layer for interacting with Redis data easily.
license	466	Copyright © 2014 ЗАО “БАРС Груп” Данная лицензия разрешает лицам, получившим копию данного программного обеспечения и сопутствующей документации (в дальнейшем именуемыми «Программное Обеспечение»), безвозмездно использовать Программное Обеспечение без ог
keywords	655	ArcLink,array,array analysis,ASC,beachball,beamforming,cross correlation,database,dataless,Dataless SEED,datamark,earthquakes,Earthworm,EIDA,envelope,events,FDSN,features,filter,focal mechanism,GSE1,GSE2,hob,iapsei-tau,imaging,instrument correction,instrument simulation,IRIS,magnitude,MiniSEED,misfit,mopad,MSEED,NERA,NERIES,observatory,ORFEUS,picker,processing,PQLX,Q,real time,realtime,RESP,response file,RT,SAC,SEED,SeedLink,SEG-2,SEG Y,SEISAN,SeisHub,Seismic Handler,seismology,seismogram,seismograms,signal,slink,spectrogram,StationXML,taper,taup,travel time,trigger,VERCE,WAV,waveform,WaveServer,WaveServerV,WebDC,web service,Winston,XML-SEED,XSEED
description	65535	[id don't want to put that here - there were 22 with this length]
summary	278	Splits one vcard file (*.vcf) to many vcard files with one vcard per file. Useful for import contacts to phones, thats do not support multiple vcard in one file. Supprt unicode cyrillic characters. Консольная программа для р
stable_version	4	None
home_page	134	http://127.0.0.1:8888/USK@9X7bw5HD2ufYvJuL3qAVsYZb3KbI9~FyRu68zsw5HVg,lhHkYYluqHi7BcW1UHoVAMcRX7E5FaZjWCOruTspwQQ,AQACAAE/pyfcp-api/0/
release_url	130	http://pypi.python.org/pypi/softwarefabrica.django.appserver/1.0dev-BZR-r10-panta-elasticworld.org-20091023132843-vitk6k7e5qlvhej5
bugtrack_url	104	https://bugzilla.redhat.com/buglist.cgi?submit&component=python-nss&product=Fedora&classification=Fedora
download_url	183	http://pypi.python.org/packages/source/s/softwarefabrica.django.appserver/softwarefabrica.django.appserver-1.0dev-BZR-r10-panta-elasticworld.org-20091023132843-vitk6k7e5qlvhej5.tar.gz
package_url	108	http://pypi.python.org/pypi/Aaaaaaaaaaaaaaaaaaa-aaaaaaaaa-aaaaaaasa-aaaaaaasa-aaaaasaa-aaaaaaasa-bbbbbbbbbbb
_pypi_hidden	-	True or False
_pypi_ordering	-	Integers from 0 to 464
cheesecake_code_kwalitee_id	-	Integers from 183 to 6513 and None values
cheesecake_documentation_id	-	Integers from 182 to 6512 and None values
cheesecake_installability_id	-	Integers from 181 to 6511 and None values

We can see multiple problems here:

URLs: localhost / 127.0.0.1 is almost certainly not desired
cheesecake_installability_id, cheesecake_documentation_id, cheesecake_code_kwalitee_id should have NULL and None should be casted to NULL. The data type is likely to be numeric.
_pypi_ordering should be numeric
_pypi_hidden should be boolean

So I changed the types in the database. Let's continue.

What people don't use

All metadata is provided by the authors of the packages. Some fields, like the package name or the description, are used very often, but some fields are only rarely used:

Maintainer

maintainer and maintainer_email is interesting for people who want to send bug reports. If this is empty, I would guess the package is dead.

Platform

SELECT
    `platform`, COUNT(`id`)
FROM
    `packages`
GROUP BY
    `platform`
ORDER BY
    COUNT(`id`) DESC

gives 759 different results. The TOP-10 were

Platform	Count
UNKNOWN	43539
any	3515
	1986
Any	635
OS Independent	542
None	265
Linux	225
linux	167
Posix; MacOS X; Windows	157
POSIX	156

You can see that there are alternative ways to express the same thing. Also, the ";" should not be here as the manual states it should be a list.

The manual is also too short for this entry. It only says "a list of platforms" which seems to be pretty much useless.

docstore.mik.ua gives a little bit more information

A list of platforms on which this distribution is known to work. You should provide this information if you have reasons to believe this distribution may not work everywhere. This information should be reasonably concise, so this field may refer for details to a file in the distribution or to a URL.

Bugtracker

This one is very important. Users should have an easy possibility to report bugs. So please help them by added your bug tracker URL wherever it makes sense. Here is how you add it on PyPI:

License

Another important one. Add a license to your software!

SELECT
    `license`, COUNT(`id`)
FROM
    packages
GROUP BY
    `license`
ORDER BY
    COUNT(`id`) DESC

gives

License	Count
UNKNOWN	11444
MIT	8098
BSD	7224
GPL	4202
MIT License	1410
(empty)	1398
LICENSE.txt	1201
GPLv3	1083
LGPL	833
ZPL 2.1	829
BSD License	811
Apache License 2.0	482
Apache License, Version 2.0	443
Apache 2.0	437
GPLv2	370
None	363
ZPL	288
GPL-3	272
GPLv3+	258
LICENSE	249

and about 6500 other licenses. Most might be variants in writing, e.g.

Apache License 2.0
Apache License, Version 2.0
Apache 2.0

are all the same license. Some packages have no licence (Unknown and empty). Many are invalid values, like

LICENSE.txt
None
LICENSE
Python

These indicate that people have no idea what to input there. tldrlegal.com might help in that case. I think the Python community should try to eliminate variants in writing license names as it makes finding, filtering and analyzing packages more difficult.

Classifiers

Python makes use of so called trove classifiers. They are defined in PEP 301 and listed here.

The following table gives the TOP-10 most commonly used classifiers:

Classifier	Percentage
Programming Language :: Python	0.4615
Intended Audience :: Developers	0.4412
Operating System :: OS Independent	0.3229
Programming Language :: Python :: 2.7	0.2183
Topic :: Software Development :: Libraries :: Pyth...	0.2038
Development Status :: 4 - Beta	0.2038
License :: OSI Approved :: BSD License	0.1632
License :: OSI Approved :: MIT License	0.1529
Environment :: Web Environment	0.1487
Programming Language :: Python :: 2.6	0.1306

When you analyze the licenses with the trove classifiers you get a different image:

SELECT
    `classifier`, COUNT(`id`) / 53533
FROM
    `package_classifiers`
WHERE
    `classifier` LIKE "License%"
GROUP BY
    `classifier`
ORDER BY
    COUNT(`id`) DESC

License	Percentage
License :: OSI Approved :: BSD License	0.1632
License :: OSI Approved :: MIT License	0.1529
License :: OSI Approved :: GNU General Public License (GPL)	0.0625
License :: OSI Approved :: Apache Software License	0.0460
License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)	0.0188

There are 94 trove classifiers with five or less packages which use this classifier. I guess many of them are not in the official list of classifiers.

Package type

SELECT
    `packagetype`, COUNT(`id`)
FROM
    `urls`
GROUP BY
    packagetype
ORDER BY
    COUNT(`id`) DESC

gives

Package type	Count
sdist	47649
bdist_egg	4933
bdist_wheel	3098
bdist_wininst	1512
bdist_dumb	577
bdist_msi	45
bdist_rpm	35
bdist_dmg	4

Can anybody explain what this means?

Downloads

SELECT
    `name`, `url`, `downloads`
FROM
    `urls`
JOIN
    `packages` ON `urls`.`package_id` = `packages`.`id`
ORDER BY
    `urls`.`downloads`  DESC
LIMIT 10

gives

Package type	Count
wincertstore	10026403
ssl	8987455
pyasn1	8655361
Paste	8401111
PyYAML	7180547
distribute	6276421
MarkupSafe	6215382
ecdsa	6030395
meld3	5715687
pika	5567400

and

SELECT
    `name`, SUM(`downloads`)
FROM
    `releases`
JOIN
    `packages` ON `releases`.`package_id` = `packages`.`id`
GROUP BY
    `name`
ORDER BY
    SUM(`downloads`)  DESC
LIMIT 10

Package type	Count
setuptools	45485205
requests	35446321
virtualenv	35039299
distribute	34779943
boto	29678066
six	28253705
certifi	27381407
pip	26266325
wincertstore	24831145
lxml	20901298

Size

What is the biggest Python package?

SELECT
    `name`, `release_number`, `size`
FROM
    `releases`
JOIN
    `packages` ON `releases`.`package_id` = `packages`.`id`
ORDER BY
    `releases`.`size` DESC
LIMIT 30

Package	Version	Size
de422	2009.1	545298406
de406	1997.1	178260546
scipy	0.13.3	62517637
appdynamics-bindeps-linux-x86	5913-master	57861536
appdynamics-bindeps-linux-x64	5913-master	56366211
python-qt5	0.1.5	56259023
python-qt5	0.1.8	56237972
pycalculix	0.92	56039839
wltp	0.0.9-alpha.3	55414544
cefpython3	31.2	55163815

Code

See github.com/MartinThoma/algorithms.

Further Ideas

Build a dependency graph: Some of the code was already written. However, one has to download about 25 GB of data, extract it and run over those files. This is quite a bit of work.
Analyze package quality
- Missing requirements
- Missing metadata / description
- Missing documentation
- PEP8
- Code duplication
Malicious package search:
Check which package names are prefixes of other package names.
Find packages which upload data (dropbox?)
Find pacakges which remove data from your file system

Analyzing PyPI Metadata

Exploring the data

Most active authors

Maximum Length

What people don't use

Maintainer

Platform

Bugtracker

License

Classifiers

Package type

Downloads

Size

Code

Further Ideas

Published

Category

Tags

Contact

Exploring the data

Most active authors

Maximum Length

What people don't use

Maintainer

Platform

Bugtracker

License

Classifiers

Package type

Downloads

Size

Code

Related

Further Ideas

Published

Category

Tags

Contact