+ All Categories

socsci

Date post: 22-Oct-2014
Category:
Upload: ruben-nesvadba
View: 49 times
Download: 1 times
Share this document with a friend
Popular Tags:
19
The Impoverished Social Scientist's Guide to Free Statistical Software and Resources Last Updated: December 18, 2008 Table of Contents General Statistics Packages Accurate Statistics Data Interactive Graphics (Data Visualization) Data Plotting (and Publication Ready Graphics) Image and Plot Analysis Data Mining Qualitative Data Text Manipuation, Management, Mining Spatial Statistics and GIS Survey Data Collection and Analysis Agent Based Simulation Dynamic Event Simulation Monte Carlo and Markov Chain Monte Carlo (MCMC) Simulation Specialized Statistical Packages Epidemiology Data Cleaning and Management Matrix Algebra, Symbolic Algebra, and Computational Algebra Systems Social Network Analysis Differential Equations and Dynamic Simulation Machine Learning Free/Open Source Software For statistical computing resources and other software for accurate computing, such as high-precision libraries, optimizers, and random number generators see our statistical computing page . And for software written by me for data distribution, accuracy, and replication see my software page . For sources of research data, see my Data Resources page. Where to Start The R The open source statistical language GPL
Transcript
Page 1: socsci

The Impoverished Social Scientist's Guide to Free Statistical Software and Resources

Last Updated: December 18, 2008

Table of Contents General Statistics Packages Accurate Statistics Data Interactive Graphics (Data Visualization) Data Plotting (and Publication Ready Graphics) Image and Plot Analysis Data Mining Qualitative Data Text Manipuation, Management, Mining Spatial Statistics and GIS Survey Data Collection and Analysis Agent Based Simulation Dynamic Event Simulation Monte Carlo and Markov Chain Monte Carlo (MCMC) Simulation Specialized Statistical Packages Epidemiology Data Cleaning and Management Matrix Algebra, Symbolic Algebra, and Computational Algebra Systems Social Network Analysis Differential Equations and Dynamic Simulation Machine Learning

Free/Open Source Software

For statistical computing resources and other software for accurate computing, such as high-precision libraries, optimizers, and random number generators see our statistical computing page. And for software written by me for data distribution, accuracy, and replication see my software page. For sources of research data, see my Data Resources page.

Where to Start

The R Statistical Language

The open source statistical language of choice for most tasks. Based on the 'S' language. Thousands of contributed packages

GPL

Other General Statistics Packages

ADE A modular multi-variate analysis program which includes modules for spatial data analysis. Plays well with R.

GPL

Adamsoft A general purpose package that specialized in client-server based data management, and large-data/low memory computations. Good for large datasets.

GPL

DataPlot A powerful, but somewhat byzantine package from the OSS

Page 2: socsci

National Institute of Standards

Gretl An open source econometrics package that plays nicely with R

GPL

ExaStat Basic statistics and regression on large data, using Windows.

OSS

Macanova Reasonably powerful & programmable, if not easy to use.

GPL

OpenStat General package focusing on teaching, IRT. OSS

PSPP Aspires to replace SPSS. Reads SPSS files and provides the data manipulation functions, but is missing most of the analytical features.

GPL

Simfit Reasonably powerful with emphasis on simulation , command-line.

OSS

WinIADAMS A free Windows package for exploratory analysis, time series, and linear models. Nice interactive multi-dimensional table browser and interactive plots.

No Source

Accurate Statistics

(The following modules for R, are very useful for highly accurate statistical computing on hard problem. For more resources, and computing libraries, see my Resources for Accurate Computing page. )

accuracy Sensitivity analysis and true random number generation GPL

gmp Multiple precision arithmetic GPL

OpenTURNS Tools for modeling uncerntainty and risks. GPL

rgenoud Optimizer using genetic algorithms and derivatives GPL

rstream Parallelizable random number generators GPL

trust Trust region based optimization GPL

UNF Universal Numeric Fingerprints -- format independent data validation.

GPL

Data-Interactive Graphics (Data Visualization)

Also see the plotting category.

Gaugin Grouping, glyphs, tableplots, oh my. GPL

GGobi Supports data interactive visualization, exploration, comp, and analysis. Includes automated projection pursuit in high-dimensions.

GPL

Improvise A Java toolkit for linked visualizations. GPL

KLIMT Interactive analysis of classification and regression trees GPL

LabPlot Data analysis and visualization GPL

Mondrian Mondrian is especially useful for interactive visualization of categorical data, and very large datasets.

No Source

Page 3: socsci

OPEN DX Generates visualizations and animations for very large scale scientific data

OSS

ParaView Parallel visualization of large datasets. GPL

prefuse Java visualization toolkit OSS

Processing A language for rapid developmet of interactive data visualizations. Well integrated with Java and can produced polished visualizations.

OSS

VISIT Parallel large data visualization software

VISTA Dynamic, interactive, multi-view graphics. Plus a very interesting visual user-interface, akin to data-desk, but more advanced statistically.

GPL

Data Plotting (and publication-ready graphics)

Almost all of the tools listed on this page have some sort of graphing capabilities. These packages specialize in it. Also see the visualization category.

Gnuplot Command-line driven plots in 2D, and 3D. GPL

GUPPI Extensible plotting tool for Gnome. GPL

Jas3 A visualization and curve fitting package in java. GPL

SciGraphica High performance plotting package similar to Microcal Origin.

GPL

Image and Plot Analysis

These packages can be used to manipulate images, extract quantitative information from images, including recovering data from published plots and graphs.

DataScan Extracts information from topographic images, microscopic images, and others.

OSS

g3data Specifically for extracting data from published graphs. GPL

Image/J Can extract data from scanned maps, charts, graphs and even photos.

OSS

Scion Image Programmable image program with data capture capabilities.

No source

Data Mining

Also see the categories on text mining and machine learning

Auton Labs Software

Dozens of independent backages for machine learning, includig many classifiers.

Source Available (registration required)

Databionic Clustering, visualization, and classification using emergent self-organizing maps.

GPL

Knime Supports data pipelines for data processing, clustering, OSS

Page 4: socsci

supervised learning, etc. GUI, CLI and API based.

ORange Predictive modeling, ensemble methods, clustering and validation, using C components and GUI widgets, and Python integration.

GPL

Rattle A Gnome based interface that glues together a large number of (clustering, association, machine learing, evaluation) modules in R for data mining

GPL

Shogun Machine learning toolbox with multiple SVM,LDA, LPM classifiers. C++ with interfaces for Octave, R, Matlab, Python

GPL

Tanagra Supports data processing streams including clustering, supervised learing, meta-spv, and cross-validation. Provides a GUI interface.

OSS

Qualitative Data Manipulation, Management, Mining and Analysis

A list of commercial and non-commercial tools for qualitative analysis is part of the open directory project and a well-subscribed discussion list about software can be found as part of jisc, and a comparison of QDAS packages is here. The Natural Language Processing TaskView describes many R packages (interfaces to external toolkits) for text understanding. The ML-Interfaces package on BioConductor provides a uniform interface to a large set of machine learning packages in R.

Advene Video annotation OSS

AnSWR From the CDC, for mixed qualitative/quantitative analysis.

No Source

Automap/ORA Text tagging (similar to Atlas-TI), with more linguistic coding options, visualization and analysis of network of concepts identified .

No Source

Elan For complex annotation of audio and video. GPL

EZ-Text From the CDC, for textual data analysis. No Source

Gate A toolkit for information extraction from text . GPL

Judge Performs automatic classification and clustering of documents,

GPL

Lingpipe Java librarie for linguistic processing and analysis. No Commercial Use

Kea Performs automated key phrase extraction. GPL

Language Archiving Technology

A hosted service for text management and analysis. Hosted

NLTK A python toolkit for natural language processsing. Includes tutorials on NPL.

GPL

Page 5: socsci

Perl The programming language for supreme text mangling. OSS

Pliny For annotating documents, text and images, and generating maps and graphs of relationships.

OSS

SIL tools If you have a lot of text on-line, the concordance, indexing, and database  from the Summer Institute of Linguistics may be what you need. 

No Source

Tabari Uses special purpose rules for categorizing news events from new text.

GPL

Tams Textual analysis and markup. Similar to Atlas-TI. GPL

TextStat Another indexing/concordance package.  GPL

VUE Visual understanding environment. Allows you to create annotated networks of multimedia objects for presentation and commentary. A sort of non-linear, scholarly, PowerPoint.

OSS

Weft For qualitative data management and coding. No Source

Weka Weka is a collection of machine learning algorithms for data mining, including text mining. (R-Weka connects Weka and R, and is available on CRAN).

GPL

Wordfish Scaling software for estimating political positions from texts.

GPL

YALE (now RapidMiner)

A flexible standalone package that contains many data mining algorithms.

GPL

Spatial Statistics and GIS

In addition to the individual packages below, the Free GIS Site and OpenSourceGis sites maintains lists of many open-source GIS packages. The CISSS Tools Clearinghouse maintains links to many spatial analysis programs. Kelly pace gives a list of links to software for advanced spatiotemporal econometrics. The AI-geostats software page has a links to geo-spatial statistics programs and code. And Rgeo lists lots of contributed packages for doing geospatial statistics with R, including 'fields', 'geoR', 'graper' , 'grass', and 'spatstat'.

Choroware Chloropleth maps with genetic algorithm generated class intervals.

GPL

CrimeStat Network, spatial and statistical analysis for crime data. Created for the National Institute of Justice.

No Source

Fragstats Designed to compute a wide variety of landscape metrics for categorical map patterns

GPL

Geoda Unusual in in its combination of GIS and spatial econmetrics.

No Source

Geovista Studio   General GIS toolkit and exploratory data analysis system GPL

Grass One the most powerful, free, geographic information system for the display of spatial data.

GPL

Page 6: socsci

LandSerf Land surface visualization and analysis No Source

SatScan Space-time scan statistics -- for analysis of disease and other clusters distributed in space and time

No Source

SAGA Combines GIS with kriging and terrain analysis GPL

Spatial Econometrics Lib.

A library of Matlab functions for advanced spatial, and spatiotemporal econometric analysis

OSS

STARSSpace time analysis of regional systems. Designed for the dynamic exploratory analysis of data measured for areal units at multiple points in time. If you have spatial time-series data, check this.

GPL

Survey Data Collection and Analysis

The general software packages above have some facilities for survey analysis. The programs below specialize in data collection and/or the analysis of complex surveys. Also see the Epidemiology section.

AM Handles analysis of complex survey samples, such as NAEP and TIMMS

No Source

dopoxtools Free research web survey hosting Hosted

Mod_survey A very mature open source survey system. It is implemented as a drop-in apache module. It supports creation of survey templates using XML, and export of the resulting data in a number of interchange formats. Mod_survey can be configured in a decentralized way, so that all users on a particular web server can administer their own surveys independently. (Also see YaaCs, below)

GPL

OpenSurveyPilot Server based web survey system GPL

PHPEsp PHP based web survey system GPL

Lime Survey PHP based web survey system GPL

PEBL A programming environment for building interactive psychology experiments

GPL

protogenie Free research web survey hosting Hosted

PsychExps A repository of experimental design scripts to be run under the macromedia authorware environment.

Mixed

Quex Suite Web based CATI system with integrated VOiP (Asterix), XML form language, and paper form scanning capability..

GPL

SurveyWiz Simple JavaScript based web survey system GPL

TESS Time-Sharing Experiments for the Social Sciences. n NSF funded infrastructure to provide both web and phone surveys.

Hosted

WebExp2 A java-based system for on-line psych experiments. No Source

Page 7: socsci

YaaCs A CATI system that uses Mod_survey for the data collection, and offers additional management of other phases of the survey work flow -- questionnaire building, interviewer management, etc.

GPL

Agent-Based Simulation

The International Society for Artificial Life maintains a list of links to many agent-based simulation framework. 

Ascape Agent based simulation package GPL

breve Simulation in a 3-D world, using Python or a simple scripting language.

GPL

EVO A simulation environment for co-evolution, based on SWARM

OSS

MASON A java-based agent-based modeling system popular in political science

OSS

NetLogo An updated dialect of the Logo language for multi-agent simulation

No Source

REPAST A multi agent simulation toolkit, with multiple implementations and built in adaptive features

OSS

Sesam Simulation system with cool visual model building interface.

OSS

> SOAR

Agent based modeling based on cognitive/AI constructs. GPL

Swarm A mature, full-featured framework for agent-based modeling, built in Objective C

GPL

Dynamic Event Simulation

This overlaps with Agent-Based Simulation above. I have listed only packages below, but several programmng libraries are also available, including: DSOL (Java), SimPy (Python), Adevs (C++) and DeX (Python, C++, Scripting).

Desmo-J Discrete event simulation framework GPL

OMNet++ OMNeT++ is a component-based, modular and open-architecture simulation environment with strong GUI support and an embeddable simulation kernel, focussing on communication networks, but general enough to be used for network, systems, and business process simulation.

Academic Source License (not open source)

Monte Carlo and Markov-Chain Monte Carlo (MCMC) Simulation

R, and many of the other general packages above can be used for MC simulation. R also has a number of modules to perform Bayesian MCMC analysis directly, and through communicating with BUGS, and JAGS.

JAGS Just another GIBBS sampler. A program for Bayesian hierarchical models. ("Not unlike BUGS")

GPL

Page 8: socsci

MCMCpack An R module to perform MCMC based analysis. Very easy to use, since it contains a large variety of pre-configured models

GPL

McSim A specially tailored Monte Carlo simulation package. Goes well beyond general packages.

GPL

OpenBugs Open source rewrite of BUGS for bayesian simulation GPL

WinBUGS Still the best BUGS for windows, but not OSS. No Source

Specialized Statistical Packages

Blossom multi-response permutation tests No Source

Fityk Nonlinear peak fitting. GPL

Gambit game theory made simple(r) OSS

gSwing Election result tracking and display GPL

M.D. Anderson Cancer Center

Has useful biostat software from the biostats department. Mixed.

MDSX Multidimensional Scaling Routines for Windows No Source

MPCA Discrete and independent component analysis. GPL

MX Structureal Equation Modeling (like LISREL) No Source

PAST PAlaeontological STatistics. Not strictly social science, of course, but the correspondence analysis, geometric analysis and cladistics could be applied fruitfully.

No Source

Sitkis Computes common bibilometric network statistics. No Source

Permap Perceptual maps created through interactive multidmensional scaling.

No Source

TETRAD A LISREL like structural equation modeling program GPL

TDA Transition Data Analysis.A system for analyzing event data , supports lots of options and models

GPL

Voteview Voteview and nominate are for viewing and analyzing roll-call voting.

GPL

Epidemiology

The CDC Software Page also offers a set of special packages for sampling design factors, meta-analysis, and spatial analysis.The WWW Virtual Epidemiology Library. Also see the category on survey tools.

MIX Guided interactive meta-analysis. GPL

Epidata Provides for programmed data entry and simple analysis. No source.

Epigrass Epigrass is a software for visualizing, analyzing and simulating of epidemic processes on geo-referenced networks.

GPL

Epi-info Epidemiological statistics, maps, reports. No Source

Openepi Javascript-based (on or off-line) simple epidemiological OSS

Page 9: socsci

statistics.

Netepi Web based secure data entry and analysis for epidemiology.

GPL

WinPepi over 75 modules for common epidemiolical methods. No Source

Data Cleaning, and Management

For managing qualitative data, see the Text Tools section. For other database options see  the Free SQL List and The ACM's Sigmod List 

Berkeley DB A fast key-value based DB. Very lightweight (much more lightweight than SQL, and does not require separate server running). Very fast for key-based retrievals.Also see thefilehash and R.huge packages for using key-value DB's in R.

OSS

CCOUNT Does data cleaning, advanced cross-tabulation, and other market research function. Also reads many mainframe-style data formats (e.g. EBCDC, Column Binary). Modeled after SPSS Quantum.

GPL

CSPRO Does form base data entry, crosstabulation, and mapping. From the U.S. Census.

GPL

DataCleaner Tools for data review and editing. OSS

HDF Hierarchical Data Format -- a portable format for representing and manipulating large scientific datasets. The latest version is compatible with netcdf. Also see the netcdf packages for R.

IVEware Multiple imputation for missing data OSS

MySql One of the most mature and stable open source SQL databases.

GPL

netCDF A portable format for repesresenting and manipulating large scientific datasets. Also see the netcdf package in R; the NCO package for manipulating netcdf data on the command line, and the Parallel-NetCDF package for high-speed access to NetCCDF data.

GPL

PostGRES One of the most mature and stable open source SQL databases.

GPL

R DBI Connects R and SQL databases. GPL

Matrix Algebra, Symbolic Algebra, and Computational Algebra Systems

These are standalone systems. For related programmer's libraries see my Resources for Numerical Accuracy listing. The following feature comparison contrasts these and a dozen other more specialized packages.

Axiom Computer algebra. Lots of functions. Good documentation

GPL

Giac/Xcas A computer algebra system. Included limited compatibility with Maple, MuPad and TI89 syntax;

GPL

Page 10: socsci

arbitrary precision

Ginac A computer algebra system. (C++ Library) GPL

FreeMat Matrix algebra system. Matlab compatibility and built-in parallelization.

GPL

GAP Computer algebra system for group theory. Computatinal discrete algebra.

OSS

JACAL. A computer algebra system. GPL

Magnus Computer algebra system for group theory. GPL

matrex A 'spreadsheet' where each cell is a matrix. Provides graphing, presentations, multi-threaded function-based calculations

GPL

Mathomatic Yet another computer algebra system GPL

Maxima A computer algebra system. GPL

OCTAVE A matrix manipulation/mathematics environment like Matlab. Mature.

GPL

PARI/GP A computer algebra system with arbitrary precision arithmetic, like Maple or Mathematica.

GPL

RLAB A matrix manipulation environment. GPL

SAGE General purpose mathematical computing environment GPL

SciLab A matrix manipulation/mathematics environment like Matlab. Mature.

GPL

Tela Tensor computing GPL

YACAS Yet another computer algebra system. (Eponymous)Comes with Euler, for numerical programming.

GPL

Yorick An older matrix language. OSS

Social Network Analysis

Also see the Spatial category above for software with complementary and overlapping spatial network and display features.

Bibexcel Bibliometric citation analysis. No Source

CiteSpace Visualizes networks over time. No Source

Cfinder Uses the clique percolation method to find overlapping dense groups of nodes.

No Source

Egonet Collection and analysis of egocentric network data. No Source

GraphViz Mathematical graph visualization OSS

Insoshi A social network platform -- useful for data collection. GPL

Nettvis Analyze and visualize social networks. Includes an on-line service.

GPL

NetworkX Python toolkit for visualization and analysis OSS

Page 11: socsci

NWD Network workbench, visualization and descriptives. OSS

Pajek Graph clustering, partitioning, citation analysis, network comparison (differences, unions), metrics.

No Source

Proximity Visualization and knowledge discovery from heterogenous relational networks.

OSS

R Modules for Network Analysis

A number of R modules mainatined by Carter Butts, including SNA, network, nettheory, metamatrix. Also see Statnet for more R network packages.

OSS

Sitkis Computes common bibilometric network statistics. No Source

SocNetV Provides core graph measures for social network analysis

GPL

Sonia Animated visualizations of logitudinal social networks GPL

STOCNET Analysis of some interesting models, including evolution of social networks, blockmodeling, dyadic variable and actor anlaysis, maximum likelihood analysis of longitudinal (evolution of) networks (through SIENNA) , core network analysis.

GPL

Tulip Visualization for extremely large graphs. Plugins are available for clustering and core graph metrics.

GPL

VISONE Provides core graph measures for social network analysis

No source

WinMine Bayesian and dependency (decision-tree) network builder

No source

Differential Equations and Dynamic Simulation

A good list of dynamic simulation packages is maintained by the SIAM activity group on dynamic systems.

PETC scientific toolkit for differential equations No Source

scirun A scientific environment for simulation and PDE's. No source

SUNDIALS Nonlinear and differential/algebraic equation Solver OSS

Machine Learning

A good list of machine learning tools is at mloss.org. Also see the categories on text mining and data mining

dysii C++ Library for probablistic learning within dynamic systems, high peformance.

GPL

Open source software, since it is inherently extensible, offers unparalleled opportunities to the researcher to do cutting edge research. Because it is free, it offers opportunities to the student or practitioner on a limited budget. This list concentrates on statistical packages that offer high-level statistical functions and that make source code freely available. Non open source free software is included only when it offers significant functionality that is not otherwise available.  A number of software companies offer academic discounts, limited trials or other closed but usable software. See below for other lists that include commercial software.

Page 12: socsci

Analyzing Data

There are some web-based statistics tutorials out there, but none that I like. I recommend some readings:

Introductory: Problem Solving, Chris Chatfield, Chapman & Hall, 1995. An excellent introduction to basic data analysis, from simple descriptive statistics through basic anova. This is a beginner's guide that emphasizes understanding data. Half the fun is in doing the exercises.

Visualization: Visualizing Data and Elements of Graphing Data, William S. Cleveland. Not as beautiful as Tufte, but a more systematic approach to the visual analysis of data. The Visual Display of Quantitative Information, Edward Tufte. A classic, and a beautiful book. You may also wish to read his later books Envisioning Information (1990) and Visual Explanations (1997). If you are interested in presenting information with maps, you may be interested in two books by Mark Monmonnier: How to Lie With Maps , and Mapping It Out.

Econometrics: Foundations of Econometrics, by Mittlehammer, Judge and Miller, is wide-ranging, and relatively gentle. Econometric Analysis by William H. Greene, is voluminous and comprehensive. A Guide to Econometrics by Peter Kennedy, shows everything that can go with a regression and what to do about it. Unifying Political Methodology: The Likelihood Theory of Statistical Inference., by Gary King is invaluable for the political science graduate student, although not up to date with advanced methods. A consistent framework for numerous models used in political science.

Spatial Statistics: Spatial Data Analysis by Robert Haining and Spatial Statistics, by Brian Ripley, are great references.

Time Series: The Analysis of Time Series: An Introduction, by Chris Chatfield, is known by legions of students. Time Series Analysis, by James Hamilton, is more comprehensive. Event History Modeling, by Box-Steffensmeier and Jones is recommended for any social scientist using this technique.

Bayesian Methods: Bayesian Data Analysis by Gelman, Carlin, Stern and Rubin, is the textbook for Bayesian methods. Social scientists should read Bayesian Methods: A social and Behavioral Sciences Approach, by Jeff Gill.

Statistical Computation: Numerical Issues in Statistical Computing for the Social Scientist is our book on the subject, and we think its great for any social scientist who needs a practical introduction. Our resources page lists many others.

Statistical Humor: Not an oxymoron, see The Gallery of Statistics Jokes

Other Lists of Statistical Software Packages

o Econometrics Journal links to software of interest to economists. Mostly commercial, but some free software is included.

o John C. Pezzullo's list of software -- lists some minor packages not listed here because there functionality is already included in other major packages, and lists commercial packages

o Free Software by STATCON o Free Statistical Software list by Andrea Corsinio Gene Shackman's Sociological Research Methods Page

Page 13: socsci

o Mailing lists, this list of discussion lists is a good place to start when you have questions about stat packages.

o MAS Scientific software linkso Stata Corporation maintains a list of other software packages, mostly

commercialo Statlib source of lots of statistical programs in SPLUS and other stat

languages.o York University's Statistical Resources Page

Caveats

"Entia non sunt mutiplicanda sine necessitate" - William of Ockham's rule "Ad indicia spectate." - Micah's corollary "Doing econometrics is like trying to learn the laws of electricity by playing the radio." - Orcutt's observation "One problem with political science is that its laboratories are unsecured, allowing real people to roam around inside them, spitting in test tubes and fiddling with computers" - Walter Kirn "You can see a lot, just by looking." - Yogi Berra

Search this site for: Search tips [ Things to do with this page: | Print it!  | Comment on it! | Track it! ]

Copyright © 1995-2009 Micah Altman


Recommended