Understanding the Text-Fabric Data format

tl;dr

The code on GitHub
The SQLite database (almost 500 MB)

The corpus

The BHSA corpus is the sort of thing that brings tears to your eyes: a freely available morphologically annotated text of the Hebrew Bible, made available for free. I don't know the faith commitments of the people behind it, but they certainly qualify as linguistic saints. :-)

To get a feel for how good this corpus is, you can browse the SHEBANQ web site, or the Parabible web site; both are derived from the same corpus.

Two things motivate this web page. The documentation shifts around and is an uneven state, so it is a little hard to get started with the data from that perspective. Second, the corpus is created using software called TextFabric, which is written in Python. I don't know Python, and altough I occasionally think that I should take the plunge, I think on the whole it'd be better for me to invest more time in C++. So my goal is to get the data into a SQLite database.

Why read from the files

The software I wrote here reads directly from the TextFabric files. Of course the original Python also reads the data files and puts them into sensible data structures. The Python API seems to change from time to time (and this has invalidated at least one previous attempt to put the data into SQLite format), whereas the file formats seem to be more stable. So I thought it best to read from the file formats.

Getting the corpus

I found it was fairly easy to install TextFabric, following the instructions on their web site. The only real complication I faced was that initially I downloaded the 32-bit version of Python rather than the 64-bit version; TextFabric requires the 64-bit version. Once you follow their instructions, you can open up a browser for the data with this command:


text-fabric bhsa

The installation downloads the corpus for you automatically. For me, it installed the data here:


C:\Users\Adam\text-fabric-data\

There are several corpora available, but as we're only talking about the BHSA corpus, the data we want are here:


C:\Users\Adam\text-fabric-data\etcbc\bhsa\tf\c

All of the files have a .tf extension. The data is in plain text format in those files.

The key file

The file otype.tf is a very important one. It contains the following data. The labels are refer to types of data. word, chapter, clause, etc., are all different object types.


1-426584	word
426585-426623	book
426624-427552	chapter
427553-515673	clause
515674-606361	clause_atom
606362-651541	half_verse
651542-904748	phrase
904749-1172289	phrase_atom
1172290-1236016	sentence
1236017-1300541	sentence_atom
1300542-1414353	subphrase
1414354-1437566	verse
1437567-1446799	lex

The numbers are node indices. You know a node's type by looking at its index. Nodes 1, 2, 3,...426584 have the type word. Nodes 426585, 426586, 426587,...426623 have the type book.

The other files

The data for each object time is in the other .tf files. The format of a .tf is pretty nicely documented. For our purposes, there are a few things to note. Each file is identified on the first line as one of three types.

@node files are the majority. This is where the information is.
@edge files represent connections between nodes. There are a few of these. (In the BHSA corpus, it happens that edges are not labeled.)
@config files just have metadata.

Each of these files represents what TextFabric calls a feature. Something that was non-intuitive to me was that different objects can have the same features. For instance, the file book.tf has features for both the book and chapter object types.

The features are all documented on the BHSA web site (currently, under “Features” at the bottom of the menu on the left).

As a SQLite database

I have put my code in a GitHub repository. (It's written in C++ with the Qt framework.) In brief:

Every data type listed in otype.tf has its own table.
Every @node-type is a column in the appropriate table.
Every @edge-type file also gets its own table.
@config-type files are ignored.

In each node table, the column _id is unique, and corresponds to the node index.

I am still working on understanding all of the features.

Finding verses

The other important file is oslots.tf, which is represented in the database with the oslots table. This table contains all of the tree-like membership data.

Example: I consult the verse table and see that Genesis 1:1 has an _id of 1414354. I query the oslots table to find the constituents of the verse:


SELECT * FROM oslots WHERE from_node='1414354' ORDER BY to_node ASC;

Result:
from_node	to_node
1414354	1
1414354	2
1414354	3
1414354	4
1414354	5
1414354	6
1414354	7
1414354	8
1414354	9
1414354	10
1414354	11

We just have to know that when it's a verse in the from_node, it will be a word in the to_node. To make a more helpful query:


# lots of data
SELECT * FROM oslots LEFT JOIN word ON oslots.to_node=word._id WHERE from_node='1414354' ORDER BY to_node ASC;
# just the highlights
SELECT word._id, g_lex_utf8,gloss,pdp FROM oslots LEFT JOIN word ON oslots.to_node=word._id WHERE from_node='1414354' ORDER BY to_node ASC;

_id	g_lex_utf8	gloss	pdp
1	בְּ	in	prep
2	רֵאשִׁית	beginning	subs
3	בָּרָא	create	verb
4	אֱלֹה	god(s)	subs
5	אֵת	<object marker>	prep
6	הַ	the	art
7	שָּׁמַי	heavens	subs
8	וְ	and	conj
9	אֵת	<object marker>	prep
10	הָ	the	art
11	אָרֶץ	earth	subs

Locating a specific verse requires a bigger query:


SELECT * FROM verse 
			LEFT JOIN oslots 
			LEFT JOIN word 
		WHERE verse._id=oslots.from_node
			AND oslots.to_node=word._id 
			AND book='Genesis' AND chapter='1' AND verse='1' ORDER BY to_node ASC;


SELECT * FROM verse 
			LEFT JOIN oslots 
			LEFT JOIN word 
		WHERE verse._id=oslots.from_node
			AND oslots.to_node=word._id 
			AND OSIS='Ps.143.8' ORDER BY to_node ASC;


SELECT group_concat(g_word_utf8,' ') AS verseText FROM 
		(SELECT g_word_utf8 FROM verse 
						LEFT JOIN oslots 
						LEFT JOIN word 
					WHERE verse._id=oslots.from_node
						AND oslots.to_node=word._id 
						AND OSIS='Ps.142.8' 
					ORDER BY number ASC);


SELECT group_concat(g_word_utf8,' ') AS verseText FROM 
								(SELECT g_word_utf8 FROM word 
											WHERE OSIS=?  
											ORDER BY number ASC);

Note that the book column in the verses table has the Latin names.