MiniCorpJS 0.5

MiniCorpJS
Version 0.5 - Last updated on January 22, 2008

What is MiniCorp?

SKIP INTRODUCTION

MiniCorpJS is a tool for small-scale corpus analysis. It helps make the creation, exploration, and quantitative analysis of corpora, particularly of dictionaries of phonological data, as simple and powerful as possible. Unlike other software for phonological analysis within the framework of Optimality Theory (OT) (for examples, see HERE), MiniCorp is not an automatic grammar learner, but instead it formalizes the intuitive procedure used by phonologists when testing grammatical hypotheses against dictionary data.

Simplicity. If you have already prepared an electronic list of phonological forms, you can go from hunch to statistically supported generalization in a matter of minutes. Special tools help speed up the process of tagging the corpus for analysis.

Power. MiniCorpJS applies techniques developed by corpus linguists to phonological data, using quantitative techniques to explore patterns latent in the data and to test specific hypotheses about them within an Optimality-Theoretic (OT) framework.

Software requirements. MiniCorpJS itself is just the page you're looking at right now: an HTML file containing JavaScript. MiniCorpJS has been most extensively tested in Firefox 2.0 running in (Traditional Chinese) WindowsXP Professional 5.1, and it also seems to work OK in Internet Explorer for Windows (version 6). The JavaScript should run OK in any post-1997 browser that has JavaScript enabled, but reports about incompatibilities or other bugs, as well as suggestions for new features, are most welcome!

To analyze your data with MiniCorpJS, you will also need to install R, the free yet powerful statistics software package that is fast becoming a world standard (www.r-project.org). R is the one free statistics program you might actually want to own. To learn more about R, including how to download it, see HERE.

Further basics. If this is your first time trying MiniCorpJS, you might want to scroll through the whole thing before starting a project, just to get a sense of what you're in for. MiniCorpJS is also printer-friendly (it's currently about 8 pages long).

Using MiniCorpJS involves a series of guided or fully automated steps, divided into three major parts: tagging, exploring (optional), and hypothesis testing. It is safest to reserve enough time for each major part to finish it in one session, since the current version of MiniCorpJS cannot save your work as a whole. Similarly, don't reload the page while you're working, or some information may be lost. You can copy your work from the screen at any time and paste it into a file as you work.

To use MiniCorpJS offline, just save this page to your computer (along with this information page, if you like). Opening offline files with JavaScript may make some newer versions of IE nervous, but rest assured that MiniCorpJS doesn't read or write anything on your hard drive (it handles input and output by asking you to paste to and copy from text windows). If you encounter security warnings using MiniCorpJS offline in IE, either tell IE to allow blocked content, or better yet, dump IE and get a proper browser.

MiniCorpJS is "copylefted" with the GNU General Public License. Basically, this means you can distribute it freely (e.g. create mirrors) and modify it any way you like, including incorporating its code in your own program to sell, but the contribution of MiniCorpJS must be acknowledged and stay open-source in your new program. See the page source for more information.

MiniCorpJS is far from perfect. If something weird happens when you use it, or if you have any suggestions about how to improve it, contact the author HERE, giving details on your operating system, browser, and what you were doing when it happened. MiniCorpJS will be upgraded eventually. If you're impatient, feel free to upgrade it yourself!

If you use MiniCorpJS in your research, please cite it. MiniCorpJS is also useful as a classroom teaching aid.

For more information about minimalist corpus phonology, MiniCorp, and R, click HERE.

Steps

Using MiniCorpJS involves following a series of steps. Scroll down (or, if you are continuing a project, click a step to jump directly to where you left off).

Prepare corpus

Load raw corpus

Tag corpus items

Save tagged corpus

Explore corpus (optional)

Classify corpus items (under construction)

Learn OT grammar (under construction)

Find corpus neighbors

Compute transitional probabilities (under construction)

Test hypotheses

Download and install R (if you haven't already)

Define OT hypothesis

Generate R command code

Paste R command code into R

R will summarize the analysis in an easy-to-read format

Prepare corpus.

A MiniCorp analysis starts with a list of items. This could be an entire electronic dictionary, or a subset defined in systematic way (in OT terms, all items that obey some high-ranked phonological or morphological constraint. The analyses will be partly based on the transcriptions of the items, but mostly on the categories that each item represents. In corpus terminology, such categories are marked with "tags". To make this process as simple as possible, MiniCorp provides tools for automatically assigning some of the tags. The tags are not arbitrary, but represent phonological hypotheses. Specifically, they are OT constraints, and items are tagged in terms of which constraints they violate.

Load raw corpus.

Paste your corpus into the large window below. Each item should be on a separate line. If your transcriptions use a non-standard font, you may be able to format them correctly by entering the name of the font in the indicated box. NOTE: This is only guaranteed to work for Unicode fonts (e.g. Doulos SIL). For non-Unicode fonts (e.g. SIL Doulos), formatting may not work correctly in browsers that follow international browser standards (e.g. Firefox).

After the raw corpus is loaded, click APPROVE CORPUS. This initial version of MiniCorp cannot handle corpora with more than 30,000 items.

ENTER RAW CORPUS HERE
	(May not work for all fonts in all browsers.) (MiniCorp indicates acceptance of input by shading.)

Tag corpus items. SKIP INTRO

MiniCorp only analyzes the patterns in your corpus expressed in terms of tags added with your help. In MiniCorp, these tags represent the OT constraints that are violated by subsets of items in your corpus (both faithfulness and markedness constraints). The patterns of constraint violations then define the phonologically relevant aspects of the items. (For examples, see below and HERE.) In this initial version of MiniCorp, you can only use up to ten constraints for tagging.

First enter the names of the constraints in the boxes at the top of the table below, from left to right. The order of the constraints doesn't have to reflect your analysis at this point, and later you may choose to ignore some of them. NOTE: Following conventions in R, constraint names cannot contain spaces or punctuation (other than "_" or ".") and cannot start with a digit. So use constraint names like xVoice (not *Voice) or IdentVoice (not Ident(Voice)).

Next, for each item and each constraint, click the empty button if that item violates that constraint. This will insert a star (*). If you change your mind, click the button again and the star will disappear. (This initial version of MiniCorp assumes that each item can receive at most one star.)

To undo the most recent change (including those made with the buttons described below), click UNDO.

If you have many items, you can supplement hand-tagging in two ways:

Click the SORT button above a constraint to sort the corpus items by their tags for that constraint. This makes it easier to see if any items have been misclassified. You can click the ORIGINAL ORDER button to return to the original order.
To automatically tag items containing substrings specified by a constraint, enter a regular expression in the appropriate box at the bottom of the table, then click the appropriate MATCH button. For more on regular expressions, see HERE. NOTE: MiniCorpJS depends on Javascript's built-in regular expression interpreter, which may be confused by Unicode fonts.

For example, here's how to tag for some markedness constraints in a language with the segment inventory {p, t, k, b, d, g, a, i, u}:

No_tp: tp
xVoice: [bdg] ([xy] = "x or y")
AgrVoice: ([ptk][bdg])|([bdg][ptk]) (| = "or")
NoGeminates: [ptkbdg]* (* = "any number of previous segment")
AgrV: (i.*[au])|(a.*[iu])|(u.*[ai]) (. = "any segment")
xFinalLab: [pb]$ ($ = "end of string")
xInitialLab: ^[pb] (^ = "start of string")
Onset: (^|\.)[aiu] (\ = "treat following segment literally"; assuming syllable breaks are marked "." in corpus)
NoCoda: [ptkbdg]($|\.) (assuming syllable breaks are marked "." in corpus)

Faith constraint violations are harder to mark automatically since the inputs are not expressed directly in the corpus, but MiniCorp can do half the job by marking all items that potentially violate specified faith constraints. If the process is neutralizing, you'll have to toggle off violations for the faithful items manually. For example:

IdentIOVoice: g (assuming [g] is always derived from /k/ and nothing else becomes voiced)
DepV: ($|\.)[ptkbdg][aiu](\.)[ptkbdg] (for vowels epenthesized in underlying onset clusters; marks for items with underlying vowels in this position must be toggled off manually)

When you are satisfied that all corpus items are tagged correctly, click APPROVE TAGS.



		Constraints:










		Regular expressions:

Save tagged corpus.

If all went well, the window below should now show your corpus tags (items are represented by numbers indicating the order in which they appear in your original file). Digits represent the number of violations. To prepare the corpus for analysis, copy and save it as a text file.

COPY TAGGED CORPUS FILE FROM HERE

JUMP TO HYPOTHESIS TESTING

Explore corpus. SKIP TO HYPOTHESIS TESTING

Unlike experimental data, a corpus exists prior to making any hypothesis. Thus it is often useful to explore it in a nondirectional way, to see what sorts of patterns emerge naturally. Moreover, corpus properties can affect behavior in experiments, so it is useful to be aware of these properties when designing or analyzing experiments. This section offers tools to classify corpus items automatically, to learn OT grammars, to find neighbors of (experimental) items in the main corpus, and to compute the transitional probability between units composing (experimental) items (again based on the main corpus).

Classify corpus items.

[Under construction.]

Learn OT grammar.

[Under construction.]

Find corpus neighbors. SKIP INTRO

When giving acceptability judgments of word-sized items, speakers may be influenced by superficial analogy with neighboring (i.e. similar) lexical items, rather than (or in addition to) productive grammar. Lexical neighbors can be defined in various ways (see HERE). Most definitions start with the notion of edit distance (Levenshtein distance), which counts the minimum number of deletions, insertions, and/or replacements of units needed to change one item into another. In Optimality-Theoretic terms, edit distance can be thought of as measuring output-output correspondence violations (summing violations of Max-OO, Dep-OO, and Ident-OO).

This initial version of MiniCorp defines the neighbor of a target item in the simplest (and most traditional way): a corpus item that has an edit distance of one. The neighborhood density of a target item is the number of its neighbors. Corpus items themselves differ in neighborhood density, but often it is useful to compute the neighborhood density for extra-lexical items, especially those used in judgment experiments.

Paste a complete list of dictionary items in the window on the left (this initial version of MiniCorp cannot handle dictionaries with more than 30,000 items). In the window on the right, enter the list of target items for which you want to compute their neighborhood densities. This second list can be a simple list of items, or a Master List created for a MiniJudge experiment. If you want to compute neighborhood densities for the dictionary itself, paste the dictionary into both windows.

NOTE: The algorithm currently used to calculate neighborhood densities is not the fastest possible. Depending on the size of your dictionary and target item set, this process may take a little while (for small dictionaries and item sets, 60 seconds may be typical). So if your browser asks whether you want to continue running the script, say yes. Be patient; the process does end eventually!

When you are ready, click COUNT NEIGHBORS.

DICTIONARY TARGET ITEM SET OR
MINIJUDGE MASTER LIST

The results should appear in the window below, arranged in three columns (the columns may not seem to line up, but they really do): ItemID (identification numbers for each item), NeighDens (neighborhood density), and LexStat (lexical status, where 0 = nonlexical and 1 = lexical). Select and save in a file.

COPY NEIGHBORHOOD DENSITIES FROM HERE

What you do next is up to you:

Revise your MiniJudge Master List to match neighborhood densities within sets

Include neighborhood density in the analysis of a MiniJudge experiment
Skip down to hypothesis testing of the whole corpus

Compute transitional probabilities.

[Under construction.]

Test hypotheses.

In this initial version, MiniCorpJS passes on the job of statistical analysis to R. MiniCorpJS will talk to R for you, and have R give a nontechnical summary of the statistical findings.

IMPORTANT: If you are restarting your work at this point, you must first paste the tagged corpus back into the window above and reselect it.

Download and install R. SKIP

MiniCorpJS is intended to analyze categorical (count) data in terms of multiple predictor variables (the constraints). This type of analyses requires special statistical tools (Poisson regression, or exact loglinear regression) that are generally not taught in introductory statistics classes. In principle they could be implemented efficiently in JavaScript, but that's for a later version.

Since for now, MiniCorpJS passes the hard work over to R, you need to download R before you can continue. R is by far the best free statistics package available, so it's worth owning if you do any quantitative research. Click HERE for more information about R, including information about how to download it.

Define OT hypothesis.

MiniCorp is not a grammar-learner, but instead tests the statistical reliability of a specified grammatical claim. In particular, it tests whether individual constraints are obeyed (or violated) in the corpus significantly more often than chance, even when the effects of all the other constraints are factored out. It also tests whether an analysis where the constraints are ranked in a specified way describes the data better than analysis where the constraints are unranked. Ranking here is in the classic OT sense, where A >> B >> C implies that A outranks both B and C simultaneously; B and C cannot "gang up" and override A.

Your grammatical hypothesis is specified simply by entering the names of the constraints you want to analyze, in an order consistent with your hypothesized ranking. For example, if you hypothesize the partial rankings A >> B, C >> D, you could enter them as A, B, C, D (each on a separate line). MiniCorp currently cannot test grammars with variable rankings.

The window below shows the constraints named in your tagged corpus, in the order you entered them. Reorder and/or delete constraints to fit your hypothesis. When you are ready, click OK.

EDIT CONSTRAINT RANKING HERE

Generate R command code.

In order to run the analysis, R will need to refer the full name of your corpus file. NOTE: R treats any filename extension (e.g. ".txt") as part of the filename. However, by default, Windows does not show the filename extension in directories. So if you're using Windows, be sure to include ".txt" at the end of your filename even if you don't see it when you display the file in a directory. (Mac users shouldn't have to worry about this.)

Enter the name of the corpus file. Then click OK to generate the R code.

Name of data file:

Paste R command code into R.

If all has gone well, by now MiniCorpJS should have generated the R command code needed to test your OT grammar on your corpus data. When you paste this code into the R program window, it will automatically test the statistical significance of each of your constraints (i.e. whether the difference in violations vs. non-violating items is greater than expected by chance) and of their proposed ranking. Thus for A>>B>>C>>D, MiniCorp will test the independent claims A>>{B,C,D}, B>>{C,D}, C>>D.

The analysis will be summarized in a brief, easy-to-read format. This is based on R's own detailed but technical format, which will be automatically saved in a file in the same folder as your data file.

To use the command code, start the R program on your computer. Then you must change R's directory to the location of your corpus file using R's FILE menu. Then simply copy and paste the code below into the R window. The code uses only default R functions, so it should run quite quickly.

COPY R COMMAND CODE FROM HERE

After running the above code, the statistical analysis of your data should be complete. R displays an easy-to-read summary and saves a more technical report in a file.

For individual constraints, weights can be positive or negative. A standard OT constraint marks violations, not non-violations, so you expect the weights for the constraints to be negative: more lexical items obey the constraint than violate it (in the context of all of the other constraints). Thus even if you get a significant result, a positive weight would be suspicious.

The ranking tests involve the comparison of two types of models, one assuming the hypothesized ranking and one not. The ranking is considered to have a significant effect if including it in the model provides significantly better coverage of the data. Technically this is done by means of the constraint weights. For example, for the hypothesized ranking A>>B, the ranking model assumes different weights a & b, while the non-ranking model assumes the null hypothesis of identical weights a = b. Note that the null hypothesis may be rejected even if a < b, exactly the opposite of what the grammar claims!

As explained above, MiniCorp tests the ranking of each constraint relative to all (hypothetically) lower constraints as a set, rather than testing all of the pairwise rankings implied by transitivity. For example, A>>B>>C is broken into the claims A>>{B,C} and B>>C; the claims A>>B and A>>C are not tested. There are three reasons for this limitation:

The claims A>>{B,C} and B>>C are independent; the probability of one being true is independent of the probability of the other being true. Thus both claims can be tested together without the risk of misleading claims of significance. This is not true for the four claims as a whole, since A>>{B,C} implies A>>B and A>>C.
The number of pairwise rankings increases geometrically with the number of constraints (for n constraints, there are n(n-1)/2 pairwise rankings), which is a confusingly large number of overlapping claims to be interpreted. By contrast, MiniCorp runs only n-1 ranking tests for n constraints.
Most importantly, strict constraint ranking in OT means that A>>{B,C} is the relevant claim about constraint A in the context of the grammar as a whole. Thus if A>>B and A>>C are true but A>>{B,C} is false, the grammar as a whole is not supported, since these results would imply that constraints B and C really do "gang up" to override A.

For help understanding your statistical results, click HERE.

Version information. MiniCorpJS 0.1 is my first attempt at a JavaScript implementation of minimalist corpus phonology.

MiniCorpJS 0.5 differs from version 0.1 in the following ways:

Fixed IE bug in handling of constraint names, added web page hit counter.
Changed sort algorithm for tag table, greatly increasing speed.
Restructured Explore Corpus (red) section, added calculation of neighborhood density.

Citing MiniCorpJS. Here's how to do it in APA style:

Myers, J. (2008). MiniCorpJS (Version 0.5) [Computer software]. Retrieved from http://www.ccunix.ccu.edu.tw/~lngproc/MiniCorpJS.htm

There's also the following conference paper (with more forthcoming):

Myers, J. (2007, December). Bridging the gap: MiniCorp analyses of Mandarin phonotactics. Poster presented at the Western Conference on Linguistics 2007, San Diego, USA.

Contact James Myers with your questions and comments.