program subbin c Copyright 2012 Leon van Dommelen. This software is made available c under the GNU Public License version 3. c Perform string substitutions in a file using file binary access. c Usage: subbin substitutions-file old-file new-file [-{n|q|v|V}] c -n is normal, -q fairly quiet, -v verbose, -V more verbose c The text read from old-file is processed according to the contents c of substitutions-file and saved as new-file. However, if there c are no changes to the file, no new file is produced. (In theory, c a new file may be produced even if it is unchanged, as changes may c cancel each other.) But in any case, if the final file would be c empty, no file will be produced. c The substitutions-file must be of the form: c original_1 c translation_1 c original_2 c translation_2 c ... c Empty lines preceding the originals are ignored. c c For each original/translation pair, the entire text is scanned and c every occurrence of the original is replaced by the translation. c c Recall that lines are terminated by either linefeed (Unix) or by c carriage-return/linefeed (DOS). To avoid ambiguity, a c carriage-return character in substitutions-file anywhere else than c before a linefeed is an error. And before a linefeed it is c ignored. c c Backslashes in either original or translation are special; they c produce special characters or start commands, as described below. c To get a plain backslash, double it as \\. c c To translate an arbitrary binary string into another one, in both c original and translation replace all carriage-returns with \M, all c linefeeds with \J and all backslashes with \\. Then terminate c each string with a newline (i.e. a linefeed character optionally c preceded by a carriage-control one). c c Watch for invisible blanks at the end of lines or at end of the c file. For example, a substitutions-file containing a single blank c space looks empty, but it will remove every blank space from c old-file. Use the cursor keys to check for the presence of such c blanks. Or select the entire substitutions-file in you editor; c usually this highlights the text, allowing trailing blanks to be c seen. Alternatively, end each line with the comment command \% c that is ignored, (along with any further characters on the line). c c The file should be in US-ASCII form. Where your UTF-8 editor c shows you a single multinational character, subbin is going to see c several separate bytes. It is going to search for each of these c bytes separately. In practical terms, UTF-8 characters will be OK c at the start of an original. However behind a \?, \!, or \~, they c are not going to produce the results you want. c Exit code is 1 on a fatal error, otherwise zero. c This program keeps the entire text being changed in memory, so is c very fast. However, it can run out of memory: g77 does not allow c dynamic memory allocation. See subbn0.f for how to increase c the storage, and some other implementation details. c c For greater speed, the program uses a simple "keep on trucking" c search algorithm. In it, each character specification in the c original with a variable count (i.e. preceded by \? or \! c described below) is pursued until it fails. In other words, c unlike in regular expressions, the algorithm will *not* try to c stop earlier to make the *next* character specification work. c This makes the algorithm much faster. I also believe it makes the c results much more predictable. A regular expression search is c likely to try possibilities that you did not think of. c c However, the algorithm does restrict you pretty much to c single-character searches behind a specification with unspecified c count. To get around that limitation, use encoding. For example, c to find \begin{verbatim}...\end{verbatim} strings, encode c \end{verbatim} as, say, Ctrl-A (i.e. \A). Then c \\begin{verbatim}\?\~\A\A will find the complete string. For more c complex encoding, consider the \e translation command below. For c checking that there is no pre-existing \A, see \! below. c c There is also another powerful trick: selection. You can search c for a string and then select that string using \s described below. c Any further substitutions are then restricted to the selection. c That means that you no longer need to worry about the rest of the c text. Do not forget to deselect with \d when done. c c You can further do a limited amount of iteration (looping) with c the \b command or with the \r and \x commands. c c A final trick allows you to modify your substitutions based on c what is in the text you are changing. In particular you can store c a found string of text in a memory string. That memory string can c then be used as part of subsequent originals and translations. c c As of version 2, you can now copy substrings of the found original c into the translation like you can do with regular expressions. c Use \{ and \} to mark the substrings, and \#N to insert them into c the translation. N must be a digit from 1 to 9. This allows you c to do a host of conversions and even hyphenation. c This program must be compiled using g77 through the makefile. ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc c COMMAND REFERENCE ccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc c GENERAL COMMANDS: c Commands describing special characters: c c \\ is \ c \ is ASCII nul (there must be a space behind the backslash) c \A through \Z are ASCII 1 to 26 (Ctrl-A through Ctrl-Z) c \7 through \1 are ASCII 27 to 31 (ESC, FS, GS, RS, US) c \2 is ASCII 127, the delete character DEL c \3 adds 128 to the ASCII value of the next character c c Comments: c c \% ignores the rest of the line c COMMANDS FOR THE ORIGINAL ONLY: c Commands selecting characters: c c command to test for begin or end of file: c \[ is begin-of-file (BOF) c \] is end-of-file (EOF) c note: substitutions before BOF and after EOF are voided. c c commands governing case sensitivity: c \- turns off case sensitivity c \= turns on case sensitivity (default at the start of each original) c \` obey the case of the original: c Letters in the translation follow the case of the corresponding c letter in the original. Note that letters in the original are c numbered from \` but in the translation from its first letter. c Repeating \` suspends numbering of the letters of the original. c Turning on case obeying also turns off case sensitivity, but it c can be turned on again. If the translation has more letters than c there are numbered letters in the original, for the surplus the c case of the last numbered letter in the original is used. c \#N substititions are not affected by this. c c commands describing special allowed character specifications: c \' is a single or double quote c \_ is a space or tab c \+ is a space, tab, carriage return, line feed, BOF, or EOF c \5 is any digit c \6 is a plus or minus sign c \a is any character, but not begin or end of file c \l is any lower case letter c \u is any upper case letter (note lower case u) c c all following characters up to, but not including, a closing character: c \: all characters following the last selected character until c the character that closes that character. For example, (\: c will first select a ( opening parenthesis, and then all characters c until, but not including, the closing ) parenthesis. Which brings c up the question what character closes what. The closers of c (, {, [, <, `, and & are ), }, ], >, ', and ; respectively. c (The latter two are for TeX quotes and html characters c respectively.) In addition, any odd-numbered control character c from \A to \Y has the next even-numbered one as closer. (However, c you are unlikely to want to use the \I\J, \K\L, and \M\N pairs, c as these are used for file formatting.) Any other character has c itself as closer. So $ is closed by $. And ) by ). c \. allows you to fully specify what character closes what. Use it c as \.MATCHING_PAIRS\. instead of \:. For example (\.()\. is c the same as (\: above. For \.\., any unspecified character always c closes itself. So for (\.\. the closer is (, not ). If you c specify more matching pairs, they must be properly nested. c For example, from the text c ...the sky is blue, [Feynman [et al], (2006) p. 13], with... c [\.[]()\.] will select "[Feynman [et al], (2006) p. 13]" but from c ...the sky is blue, [Feynman [et al], 2006) p. 13], with... c or: c ...the sky is blue, [Feynman [et (al], 2006) p. 13], with... c it will not select anything because the parentheses do not match. c c commands to define a group of allowed character specifications: c \& starts a group of match specifications (not using \&, \?, \!, \~) c \* ends the group c aliases are \( and \) c c command to invert allowed and nonallowed characters: c \~ inverts the following matching c c command to allow zero or more occurrences to count as a match: c \? is all following matching characters (zero or more) c \! is up to 1 following matching characters (zero or one) c c Compound sequences must be ordered as one of c [case change][\?][\~]allowed_character_specification c [case change][\?][\~]\&...\* c Case changes inside ... are fine and are not limited to the sequence. c c For example, to match an integer number, use \?\+\?\6\5\?\5. To c match everything until the next letter, use \?\~\&\u\l\*, or c \-\?\~\u\=, or even \?\~\&\-\l\=\*. c c To address the entire file, original \a\?\a is best. For example, to c change the file to upper case, use a substitution file: c \a\?\a c \^ c Note that \?\a will be found twice, (first all characters will be found, c then zero characters at EOF will be found,) so \a\?\a is really needed. c c Inside groups, any hyphen must be the first character of the group. c Otherwise the hypen is taken to indicate a range. For example, 0-5 c inside a group is a digit up to 5, and b-df-hj-np-tv-z is a consonant. c c To avoid infinite loops, substitutions that replace no characters c in the original file are only done once. For example, c \= c \J c puts a single linefeed before every character in the original file, c including before EOF, not infinitely many of them. Similarly, c \-\?\l c - c replaces every contiguous sequence of letters with a single hyphen, c but also puts exactly one hyphen before each nonletter character, c including one before EOF. c Use memory value: c c \m takes the original to be the current memory value as set by \c c below. If the memory value is defined but nul, the original is c taken as \~\&\a\[\]\*, or in words, not anything. c Substring selection: c c \{ marks the start of a substring of the original c \{ marks the end of a substring of the original c c These substrings can be used as part of the translation using \# below. c Restart substitutions-file: c c \r Restart substitutions-file from the beginning. c You may need to run subbin with two or more substitions-files if c some substitutions should not be repeated. c Note that \r should be at the end of the substitutions-file. c Anything beyond \r will be ignored. c There is no translation string following this original. c \x Exit if there were no nontrivial substitutions since the previous c time that we were here. Note that it is in principle possible c that changes undo other changes. This might produce an infinite c loop for some substitutions-files. Normally \x preceeds \r. c There is no translation string following this original. c Undo selection of a subpart: c c \d# unselects # levels of selection done by \s below, or unselects all c if # is omitted. c There is no translation string following this original. c Set an external file name c c \f# enters a name for a subbin data file. c What happens depends on the number #, as follows: c 1: A byte-to-unicode mapping is read in from the file, c replacing any existing one. See the \| translation. c 2: A number to string mapping is read in from the file, c replacing any existing one. See the \= translation. c 3: Sets the name with the file of saved strings used by the c \i or \o commands. Use before the first such command to c use the file. If a file of saved strings is already open, c it is closed first, and the \o label number *reset to c restart from 321000001*. c 4: Hyphenation data are read in from the file, replacing any c existing ones. c The actual file name should be on the next line. Do not forget c to double backslashes, especially when using MS DOS or Windows. c COMMANDS FOR THE TRANSLATION ONLY: c Insert memory value: c c \m insert the current memory value as set by \c below. c Insert line number c c \n insert the line number of the first character of the found c text (as it was before the active substitutions) c Use substring: c c \#N uses substring number N, where N is a digit from 1 to 9. c \#0 uses the entire original. c Conversion of the original or its substrings: c c Note that only one conversion type is allowed at a time and c that the conversion type will reset to none after a conversion. c Further, if any one of the below conversion commands appears at c the end of a line, \#0 will be appended. c c Case conversions: c c \l converts the original to lower case (as ASCII). c \u converts the original to uppercase (as ASCII). c \$ converts the first letter in the original to upper case (as ASCII). c c Conversion between 7 bit and 8 bit characters: c c \4 converts any 7 bit character in the original to the corresponding c 8 bit one and vice-versa. Typically used to prevent substitutions c for certain strings. To check for pre-existing 8 bit characters c use translation \! for original \(\3\ -\3\2\). c c Hyphenation c c \h hyphenates the selected word. See hyphen.f for more. c c Conversions between character encoding formats: c c Note: A character in "internal" format takes the form \U\l\l...\l\V c where \l means a lower case letter, \U means Ctrl-U, and \V Ctrl-V. c This format is convenient since you can then search for words using c \-\(\U\l\?\l\V\)\?\(\U\l\?\l\V\). To convert to another format than c internal, first convert to internal and then from internal to the c other format. c c \& converts between &...; html format and internal, bidirectionally. c Normally, characters in html format are written by name, if there c is one. However, if you double the \&, the html character is c written by number. If you triple the \&, the html character is c written by name if there is one and the resulting string is no c longer than the numeric one. Numbers are normally written in c decimal form. However, the \x command changes that to hexadecimal. c \x sets hexadecimal mode for writing numbers. This remains active c for a given original until turned off by a second \x. c \' converts between UTF-8 format and internal, bidirectionally. c If the string starts with Ctrl-U, ends with Ctrl-V, and c has only lower case letters inside, one or more, conversion from c internal format to UTF-8 will be attempted. (Such a string c matches the original "\U\l\?\l\V" without the quotes.) In *all* c other cases, conversion of the string from UTF8 to internal format c will be attempted (leaving all ASCII unchanged). c \" like \', but all selected ASCII is converted to internal too. c \` converts between RTF unicode and internal. There must be exactly c 1 additional nondigit character that is used to terminate each c \u[number] (sub)string. This would typically be the ? that is c printed by non-unicode aware programs (in \ucs1 format). In fact, c the conversion from internal to RTF always puts in a ?. Note c that if someone gives you a .rtf file with unknown or variable c ucs values, it will presumably be impossible to convert the c unicode characters to other formats in a generic way. However, c if you know, say, that the file always uses ucs 0 or 1, it is much c easier. And if you create the RTF yourself, there is obviously no c difficulty. For Unicode numbers over 65535, each of the two c \u[number] strings has a ?, or whatever. (That does not make much c sense to me, but that is what the MS RTF 1.9.1 whitepaper shows.) c Because of the possibility of a second string, \` must be alone. c So the original/translation pair for converting to internal will be c \\u\!-\(0-9\)\?\(0-9\)\a\% (or maybe ? instead of \a) c \` c and the algorithm will automatically also gobble up the second c string if there is one. c See \' above for how the direction of conversion is decided. c \| converts between an 8 bit character format (like ISO-8859) and c internal, bidirectionally. If this is used, the working directory c must contain a file 8bit_to_uc.sub. (Or you must have read in c an equivalent file with the \f1 command.) Each line in the file c must list an 8 bit character number (from 0 to 255), and behind it c the corresponding unicode value (from 0 to 1114111). Byte values c not listed in 8bit_to_uc.sub are converted to the unicode character c of the same number. c If you use an \f1 command followed by filename nil, no file is c read and an identity mapping is generated. This can be used to c convert every byte in a file to the corresponding internal number. c From there it could be converted to say html form. c See \' above for how the direction of conversion is decided. c \= converts between strings and internal. The number to string c mapping must be given in a file num_str-map.sub. (Or you must have c read in an equivalent file with the \f2 command.) This file may c have data lines like, say: c 163 \pounds # POUND SIGN c -5 8364 \euro # EURO SIGN c where 163 and 8364 are the numbers, \pounds and \euro the strings, c and -5 gives the length of the \euro string. The numbers, 163 c and 8364 above, must be in nondecreasing order. See the source c code of numstr (in the hyphen_*.f files) for more details on the c allowed number/string mapping files. c In the conversion from strings to numbers, the found string c must normally match the string in the data file exactly. However, c if the \= is repeated 1 or 3 times, whitespace and control c characters in the found string are ignored in the comparison. c If the \= is repeated 2 or 3 times, the found string is converted c to lower case in doing the comparison. If you want whitespace or c case of the data file string to be ignored, you will need to put c every variation of the string in the data file, making the number c nonunique. In conversion from number to string, the first variation c will then be used. c Note that \= is redundant: everything it does can be done with c normal subbin original/substitution lines. However, \= may be c more convenient, and/or much faster, for large amounts of numbered c strings. (Every original/substitution line requires the text to be c read through from scratch. However one reading through using \= c could convert many different strings.) Also note that strings can c be converted into other strings by running subbin.exe twice or c changing the mapping halfway using the \f2 command. c \6 converts between a UTF-16, little-endian encoded string and c internal, bidirectionally. Normally, this should be used to c convert an entire file. See \' above for how the direction of c conversion is decided. In the conversion to internal, multiple c characters can be converted (typically with original \a\?\a). c \5 is like 6, except big endian format is used. c c Conversions between integer number formats: c c For these, the selected string must be a valid positive number with c no additional spaces. You can convert between any of these formats c by using decimal as an intermediate stage. Note that numbers c must be less than the machine limit (2147483647 usually). c c \} converts from hexadecimal to decimal c \{ converts from decimal to hexadecimal c \) converts from octal to decimal c \( converts from decimal to octal c \> converts from binary to decimal c \< converts from decimal to binary c c Forgiveness of errors c c \+ tries to fix up errors without aborting where it seems reasonable. c In particular the failed conversion is abandoned beyond the failure c point and the rest of the string is kept as is. Use with caution. c Repeat for more forgiveness. c \- Reduces forgiveness again. c Encoding and decoding: c c \esss...ssc encodes any character string sss..s in the original as c the single character c, usually for further manipulation c like with \r below. If a character like ASCII nul does c not appear in the file, it makes a good choice for c. c \rcsss...ss replaces every character c in the original by character c string sss...ss. This string may be empty to delete c. c c These commands must appear alone. c Saving and restoring originals c c \o outputs the expanded original to a file and puts a numeric label c for it in the translation. Labels are consecutive 9 digit numbers c starting with 321: 321000001, 321000002, 321000003, ... c \i causes an original to be inserted again. In particular, c at the following \# [sub]string, which must be a valid label, c the saved original with that label is inserted. c c For example, the substitutions file c \end{equation} c \A c \begin{equation}\?\~\A\A c equation \o c replaces every LaTeX equation environment by a string c equation 321...... c This avoids problems with spell and grammar checkers and such. c Afterwards, a substitutions file c equation \{321\(0-9\)\(0-9\)\(0-9\)\(0-9\)\(0-9\)\(0-9\)\} c \i\#1 c will restore the equation environments again. c c Note that if 8 bit swap is active during the output or input, it is c applied to the original saved in subbinss.sub c c The file name for the saved strings can be set with the \f3 c command before the first \i or \o command. The default is c subbinss.sub. c c Warnings: c 1) Any existing file with that name will be overwritten. c 2) You cannot mix \o and \i commands for an open file with saved c string. To switch between \o and \i commands, you need to reopen c the file using \f3 or run subbin.exe twice. c 3) Of course, if a file of saved strings gets overwritten, corrupted c or deleted, the corresponding orginals can no longer be restored. c 4) The file of saved strings will be in DOS CR-LF format. That is c to make it more readable under MS Windows. c 5) Each \o translation should have a unique format. For example c if you also convert displaymath environments in addition c to the above, use "display \o" as translation, not again c "equation \o". It is recommended that the word immediately c preceding the \o is unique. Otherwise, restoration errors c may be possible and/or the restoring of environments can become c hopelessly slow. c 6) If there are already 9 digit numbers starting with 321 in c the original text file, they might cause problems. Check c for that using \!. (And in the example above, also check for c the \A used to encode the \end{equation} strings.) c Translation specials for a specific occurence: c c In the following special translation strings, # stands for an number c specifying a particular occurrence of the original string. If the c number is omitted, the last occurrence is used. If there are less c occurrences of the original string than the number #, a nul string at c the end of file is selected instead. Number # may not contain spaces, c including no leading or trailing spaces. c c \b# does not modify the originals, but selects the #-th instance. c Further manipulations will be restricted to the selected string, c until the string is deselected using original \d. If at the c time of deselection the selection is not nil, execution will c loop back to the \b statement, selecting the next instance c for processing. Note that this may produce an infinite loops c Make sure that eventually the selection must be nil at the point c of deselection. c \c# does not modify the originals, but copies the #-th instance c to the memory location. c \f# simply finds the #-th occurrence. (Not useful, included for c programming purposes.) c \s# is like \b, but does not loop. Only one instance is processed. c Do not forget to deselect with \d before exiting. c \v# does not modify the originals, except converts the #-th instance c to lowercase (as ASCII). c \^# does not modify the originals, except converts the #-th instance c to uppercase (as ASCII). c \!# terminates the program with error code when the #-th instance c is found. Normally, # should be followed by : and a description c of the problem. c \?# Like \!#, but execution continues after a warning, unless the c user aborts by entering q. c \,# like \!#, but exits with zero error code. c \.# like \,#, but exits if the string is *not* found. c c These commands must appear alone. If the original contains a c substring, that substring is selected instead of the entire c string. c List of special characters used in original and translation: c _!"#$%&'()*+,-./0123456789:;<=>?@A-Z[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ c oo ooooooo oo oooo oooooo o o oooooo ooo o o oo o o o o oo c tttttttttt tttt tttttttttt tttt ttt t t t tt tt tt tt t tt tt t ttt c declarations c avoid typos implicit none c common debug data c verbosity integer verbos c substitions file character*132 subfil integer subfll c line count of the substitutions file integer lincts c text file character*132 txtfil integer txtfll c line count of the text file integer linctt c debug block common/cmdbg/verbos,subfil,subfll,lincts,txtfil,txtfll,linctt c local c output file character*132 outfil integer outfll c command line arguments integer argmx character*132 argt(4) integer argtl(4) external args c check on largest integer integer check c whether the unit is connected logical isopen c executable c generic format 10 format(a) c note that we are in the main program call insub('main') c initialize the common block verbos=0 subfll=0 txtfll=0 c check that integers can be as big as the program assumes check=2147483647 if(check.le.0) & call fatal('Program error: unsupported integer format') c get the command line arguments call args(4, & 'Usage: subbin substitutions_file in_file out_file [-{q|n|v|V}]', & argmx,argt,argtl) c process them if(argmx.eq.1)then if(argt(1)(1:argtl(1)).eq.'-v')then print10,'subbin version 2.0' call exit(0) endif if(argt(1)(1:argtl(1)).eq.'-h' .or. & argt(1)(1:argtl(1)).eq.'?' .or. & argt(1)(1:argtl(1)).eq.'--help')then print10, & 'See the comments in source subbin.f for information.' call exit(0) endif endif if(argmx.lt.3 .or. argmx.gt.4)goto 9990 if(argtl(1).gt.132)call fatal('Substitutions filename too long') subfll=argtl(1) subfil=argt(1) lincts=0 if(argtl(2).gt.132)call fatal('Input filename too long') txtfll=argtl(2) txtfil=argt(2) linctt=0 if(argtl(3).gt.132)call fatal('Output filename too long') outfll=argtl(3) outfil=argt(3) if(argmx.eq.4)then if(argt(4)(1:argtl(4)).eq.'-n')verbos=-9 if(argt(4)(1:argtl(4)).eq.'-q')verbos=-1 if(argt(4)(1:argtl(4)).eq.'-v')verbos=1 if(argt(4)(1:argtl(4)).eq.'-V')verbos=2 if(verbos.eq.0)goto 9990 if(verbos.eq.-9)verbos=0 endif c subbn0 does the actual work call subbn0(outfil(1:outfll)) c all done inquire(4,opened=isopen) if(isopen)call closu(4,'the saved-strings data file') call exit(0) c error exit 9990 call fatal( & 'Usage: subbin substitutions_file in_file out_file [-{q|n|v|V}]') end