tl;dr
The BHSA corpus is the sort of thing that brings tears to your eyes: a freely available morphologically annotated text of the Hebrew Bible, made available for free. I don't know the faith commitments of the people behind it, but they certainly qualify as linguistic saints. :-)
To get a feel for how good this corpus is, you can browse the SHEBANQ web site, or the Parabible web site; both are derived from the same corpus.
Two things motivate this web page. The documentation shifts around and is an uneven state, so it is a little hard to get started with the data from that perspective. Second, the corpus is created using software called TextFabric, which is written in Python. I don't know Python, and altough I occasionally think that I should take the plunge, I think on the whole it'd be better for me to invest more time in C++. So my goal is to get the data into a SQLite database.
The software I wrote here reads directly from the TextFabric files. Of course the original Python also reads the data files and puts them into sensible data structures. The Python API seems to change from time to time (and this has invalidated at least one previous attempt to put the data into SQLite format), whereas the file formats seem to be more stable. So I thought it best to read from the file formats.
I found it was fairly easy to install TextFabric, following the instructions on their web site. The only real complication I faced was that initially I downloaded the 32-bit version of Python rather than the 64-bit version; TextFabric requires the 64-bit version. Once you follow their instructions, you can open up a browser for the data with this command:
text-fabric bhsa
The installation downloads the corpus for you automatically. For me, it installed the data here:
C:\Users\Adam\text-fabric-data\
There are several corpora available, but as we're only talking about the BHSA corpus, the data we want are here:
C:\Users\Adam\text-fabric-data\etcbc\bhsa\tf\c
All of the files have a .tf
extension. The data is in plain text format in those files.
The file otype.tf
is a very important one. It contains the following data. The labels are refer to types of data. word
, chapter
, clause
, etc., are all different object types.
1-426584 word
426585-426623 book
426624-427552 chapter
427553-515673 clause
515674-606361 clause_atom
606362-651541 half_verse
651542-904748 phrase
904749-1172289 phrase_atom
1172290-1236016 sentence
1236017-1300541 sentence_atom
1300542-1414353 subphrase
1414354-1437566 verse
1437567-1446799 lex
The numbers are node indices. You know a node's type by looking at its index. Nodes 1, 2, 3,...426584 have the type word
. Nodes 426585, 426586, 426587,...426623 have the type book
.
The data for each object time is in the other .tf
files. The format of a .tf
is pretty nicely documented. For our purposes, there are a few things to note. Each file is identified on the first line as one of three types.
@node
files are the majority. This is where the information is.@edge
files represent connections between nodes. There are a few of these. (In the BHSA corpus, it happens that edges are not labeled.)@config
files just have metadata.Each of these files represents what TextFabric calls a feature. Something that was non-intuitive to me was that different objects can have the same features. For instance, the file book.tf
has features for both the book
and chapter
object types.
The features are all documented on the BHSA web site (currently, under “Features” at the bottom of the menu on the left).
I have put my code in a GitHub repository. (It's written in C++ with the Qt framework.) In brief:
otype.tf
has its own table.@node
-type is a column in the appropriate table.@edge
-type file also gets its own table.@config
-type files are ignored.In each node table, the column _id
is unique, and corresponds to the node index.
I am still working on understanding all of the features.
The other important file is oslots.tf
, which is represented in the database with the oslots
table. This table contains all of the tree-like membership data.
Example: I consult the verse
table and see that Genesis 1:1 has an _id
of 1414354. I query the oslots
table to find the constituents of the verse:
SELECT * FROM oslots WHERE from_node='1414354' ORDER BY to_node ASC;
Result:
from_node to_node
1414354 1
1414354 2
1414354 3
1414354 4
1414354 5
1414354 6
1414354 7
1414354 8
1414354 9
1414354 10
1414354 11
We just have to know that when it's a verse
in the from_node
, it will be a word
in the to_node
. To make a more helpful query:
# lots of data
SELECT * FROM oslots LEFT JOIN word ON oslots.to_node=word._id WHERE from_node='1414354' ORDER BY to_node ASC;
# just the highlights
SELECT word._id, g_lex_utf8,gloss,pdp FROM oslots LEFT JOIN word ON oslots.to_node=word._id WHERE from_node='1414354' ORDER BY to_node ASC;
_id g_lex_utf8 gloss pdp
1 בְּ in prep
2 רֵאשִׁית beginning subs
3 בָּרָא create verb
4 אֱלֹה god(s) subs
5 אֵת <object marker> prep
6 הַ the art
7 שָּׁמַי heavens subs
8 וְ and conj
9 אֵת <object marker> prep
10 הָ the art
11 אָרֶץ earth subs
Locating a specific verse requires a bigger query:
SELECT * FROM verse
LEFT JOIN oslots
LEFT JOIN word
WHERE verse._id=oslots.from_node
AND oslots.to_node=word._id
AND book='Genesis' AND chapter='1' AND verse='1' ORDER BY to_node ASC;
SELECT * FROM verse
LEFT JOIN oslots
LEFT JOIN word
WHERE verse._id=oslots.from_node
AND oslots.to_node=word._id
AND OSIS='Ps.143.8' ORDER BY to_node ASC;
SELECT group_concat(g_word_utf8,' ') AS verseText FROM
(SELECT g_word_utf8 FROM verse
LEFT JOIN oslots
LEFT JOIN word
WHERE verse._id=oslots.from_node
AND oslots.to_node=word._id
AND OSIS='Ps.142.8'
ORDER BY number ASC);
SELECT group_concat(g_word_utf8,' ') AS verseText FROM
(SELECT g_word_utf8 FROM word
WHERE OSIS=?
ORDER BY number ASC);
Note that the book column in the verses
table has the Latin names.
All contents © 2024 Adam Baker, except where otherwise noted.