tl;dr

The corpus

The BHSA corpus is the sort of thing that brings tears to your eyes: a freely available morphologically annotated text of the Hebrew Bible, made available for free. I don't know the faith commitments of the people behind it, but they certainly qualify as linguistic saints. :-)

To get a feel for how good this corpus is, you can browse the SHEBANQ web site, or the Parabible web site; both are derived from the same corpus.

Two things motivate this web page. The documentation shifts around and is an uneven state, so it is a little hard to get started with the data from that perspective. Second, the corpus is created using software called TextFabric, which is written in Python. I don't know Python, and altough I occasionally think that I should take the plunge, I think on the whole it'd be better for me to invest more time in C++. So my goal is to get the data into a SQLite database.

Why read from the files

The software I wrote here reads directly from the TextFabric files. Of course the original Python also reads the data files and puts them into sensible data structures. The Python API seems to change from time to time (and this has invalidated at least one previous attempt to put the data into SQLite format), whereas the file formats seem to be more stable. So I thought it best to read from the file formats.

Getting the corpus

I found it was fairly easy to install TextFabric, following the instructions on their web site. The only real complication I faced was that initially I downloaded the 32-bit version of Python rather than the 64-bit version; TextFabric requires the 64-bit version. Once you follow their instructions, you can open up a browser for the data with this command:

text-fabric bhsa

The installation downloads the corpus for you automatically. For me, it installed the data here:

C:\Users\Adam\text-fabric-data\

There are several corpora available, but as we're only talking about the BHSA corpus, the data we want are here:

C:\Users\Adam\text-fabric-data\etcbc\bhsa\tf\c

All of the files have a .tf extension. The data is in plain text format in those files.

The key file

The file otype.tf is a very important one. It contains the following data. The labels are refer to types of data. word, chapter, clause, etc., are all different object types.

1-426584 word 426585-426623 book 426624-427552 chapter 427553-515673 clause 515674-606361 clause_atom 606362-651541 half_verse 651542-904748 phrase 904749-1172289 phrase_atom 1172290-1236016 sentence 1236017-1300541 sentence_atom 1300542-1414353 subphrase 1414354-1437566 verse 1437567-1446799 lex

The numbers are node indices. You know a node's type by looking at its index. Nodes 1, 2, 3,...426584 have the type word. Nodes 426585, 426586, 426587,...426623 have the type book.

The other files

The data for each object time is in the other .tf files. The format of a .tf is pretty nicely documented. For our purposes, there are a few things to note. Each file is identified on the first line as one of three types.

Each of these files represents what TextFabric calls a feature. Something that was non-intuitive to me was that different objects can have the same features. For instance, the file book.tf has features for both the book and chapter object types.

The features are all documented on the BHSA web site (currently, under “Features” at the bottom of the menu on the left).

As a SQLite database

I have put my code in a GitHub repository. (It's written in C++ with the Qt framework.) In brief:

In each node table, the column _id is unique, and corresponds to the node index.

I am still working on understanding all of the features.

Finding verses

The other important file is oslots.tf, which is represented in the database with the oslots table. This table contains all of the tree-like membership data.

Example: I consult the verse table and see that Genesis 1:1 has an _id of 1414354. I query the oslots table to find the constituents of the verse:

SELECT * FROM oslots WHERE from_node='1414354' ORDER BY to_node ASC; Result: from_node to_node 1414354 1 1414354 2 1414354 3 1414354 4 1414354 5 1414354 6 1414354 7 1414354 8 1414354 9 1414354 10 1414354 11

We just have to know that when it's a verse in the from_node, it will be a word in the to_node. To make a more helpful query:

# lots of data SELECT * FROM oslots LEFT JOIN word ON oslots.to_node=word._id WHERE from_node='1414354' ORDER BY to_node ASC; # just the highlights SELECT word._id, g_lex_utf8,gloss,pdp FROM oslots LEFT JOIN word ON oslots.to_node=word._id WHERE from_node='1414354' ORDER BY to_node ASC; _id g_lex_utf8 gloss pdp 1 בְּ in prep 2 רֵאשִׁית beginning subs 3 בָּרָא create verb 4 אֱלֹה god(s) subs 5 אֵת <object marker> prep 6 הַ the art 7 שָּׁמַי heavens subs 8 וְ and conj 9 אֵת <object marker> prep 10 הָ the art 11 אָרֶץ earth subs

Locating a specific verse requires a bigger query:

SELECT * FROM verse LEFT JOIN oslots LEFT JOIN word WHERE verse._id=oslots.from_node AND oslots.to_node=word._id AND book='Genesis' AND chapter='1' AND verse='1' ORDER BY to_node ASC; SELECT * FROM verse LEFT JOIN oslots LEFT JOIN word WHERE verse._id=oslots.from_node AND oslots.to_node=word._id AND OSIS='Ps.143.8' ORDER BY to_node ASC; SELECT group_concat(g_word_utf8,' ') AS verseText FROM (SELECT g_word_utf8 FROM verse LEFT JOIN oslots LEFT JOIN word WHERE verse._id=oslots.from_node AND oslots.to_node=word._id AND OSIS='Ps.142.8' ORDER BY number ASC); SELECT group_concat(g_word_utf8,' ') AS verseText FROM (SELECT g_word_utf8 FROM word WHERE OSIS=? ORDER BY number ASC);

Note that the book column in the verses table has the Latin names.

All contents © 2023 Adam Baker, except where otherwise noted.