Results 1 to 8 of 8

Thread: Howto parse .txt file? Need to get in format for database load

  1. #1
    Senior Member Joe-MN's Avatar
    Join Date
    Sep 2009
    Location
    Twin Cities, MN , USA
    Posts
    331

    Howto parse .txt file? Need to get in format for database load

    Hello,

    I have a .txt file with lots of data I have saved over time, what is the best (free) software that will allow me to get it into a .csv file for inserting into a database.

    The 'records' in the text file are all separated by a string of identifiable characters, so I will know when a new one starts. First part of the 'record' will have a bunch of short fields delimited by '/'s, but after the nth '/', it would be a long stretch of description until the next 'record' starts.

    Will python work? How about Windows PowerShell.

    Input from text file, output to csv file? Is the csv format the best of output? Will be using MS-Access for database

    Thanks
    Better to remain silent and be thought a fool, than to speak and remove all doubt.

  2. #2
    Python will do it (google something like "python parse strings" or "python split strings"). Perl or Ruby would be other good quick and dirty options.

    CSV or tab delimited files will probably be fine for a one time import. If you're planning to do this a bunch, I'd probably format the output as JSON matching the database schema.

    Edited to add: if you use CSV, and any of your parsed strings have commas in them, be sure to enclose them in quotes:

    "this string, it contains a comma"
    Last edited by Katja; 06-24-2015 at 11:11 PM.

  3. #3
    cat foo.txt | sed 's/\//,/g' > foo.csv

    That is on linux. If it isn't full of private info just message me and I can convert it in 2 seconds. If it is download sed from cygwin

    Quote Originally Posted by Joe-MN View Post
    Hello,

    I have a .txt file with lots of data I have saved over time, what is the best (free) software that will allow me to get it into a .csv file for inserting into a database.

    The 'records' in the text file are all separated by a string of identifiable characters, so I will know when a new one starts. First part of the 'record' will have a bunch of short fields delimited by '/'s, but after the nth '/', it would be a long stretch of description until the next 'record' starts.

    Will python work? How about Windows PowerShell.

    Input from text file, output to csv file? Is the csv format the best of output? Will be using MS-Access for database

    Thanks

  4. #4
    Senior Member Joe-MN's Avatar
    Join Date
    Sep 2009
    Location
    Twin Cities, MN , USA
    Posts
    331
    Thanks for the offer. Data is not really private, but it is dirty (as in not all formatted the same, NOT XXX!!!). I will try to do it myself first. Can you point me to someplace with the syntax of the Linux command that you are suggesting to use? I have used Ubuntu just a little. Thanks

    Quote Originally Posted by t8burst View Post
    cat foo.txt | sed 's/\//,/g' > foo.csv

    That is on linux. If it isn't full of private info just message me and I can convert it in 2 seconds. If it is download sed from cygwin
    Better to remain silent and be thought a fool, than to speak and remove all doubt.

  5. #5
    Senior Member zagam's Avatar
    Join Date
    Jan 2007
    Location
    Western Australia - Hammer wielding daemon
    Posts
    619
    I use AWK others use Perl. See the AWK book.

  6. #6
    Quote Originally Posted by Joe-MN View Post
    Thanks for the offer. Data is not really private, but it is dirty (as in not all formatted the same, NOT XXX!!!). I will try to do it myself first. Can you point me to someplace with the syntax of the Linux command that you are suggesting to use? I have used Ubuntu just a little. Thanks
    http://unixhelp.ed.ac.uk/CGI/man-cgi?sed

    There are 50 different ways to do what you want. But if all you need to do is replace "/" with "," then sed is easy. If you have a ton of files I would probably use perl but for a couple sed is fine. sed is just a way to pipe regular expressions through a file. If you want to get fancy google "regular expressions" and embrace the cryptic world of regex
    Last edited by t8burst; 06-26-2015 at 09:56 AM.

  7. #7
    Senior Member zagam's Avatar
    Join Date
    Jan 2007
    Location
    Western Australia - Hammer wielding daemon
    Posts
    619

    CSV is harmful

    The problem with CSV is that it uses commas, which are normal printable characters. Then it hides them with quotes that are also normal printable characters. Quotes and escapes are harmful and are an attack vector. I have actually exploited the C <string.h> and <stdio.h> '\0' with real programs.

    In Unix you have tab as a field separator FS and line feed as a record separator RS. If you exclude these non printing characters from your data they will always be valid FS or RS. My C program below replaces any FS or RS with a space when it converts CSV to TSV.

    With Unix data files do not need to load data before you can use it as is the case with XML, S-Expressions, CSV, etc.

    If sorted with sort(1) you can query with the look(1) command. Unix is a 4GL in that it operates on sets of data rather than elements. It also has file security to avoid severe data loss. Read the man pages.

    Code:
    LC_COLLATE=C
    export LC_COLLATE
    # Exact
    look -b "$KEY^I" file
    # Partial
    look -b "$KEY" file. 
    # Use join(1) and cut(1)
    look -b "$KEY^I" file1 | join -t'^I' - file2 | cut -d'^I' -f1-3
    # Use comm() for other set operations
    # use sort() to sort and or merge data
    sort -m -t'^I' -k1 file1 file2 >file3
    # This can be done in a lock with umask(1)
    # Read only lock file created open for writing
    # Write new data in lock file
    # Make readable and then move in to place
    umask 0226
    {
    sort -m -u -t'^I' -k1 file1 file2
    chmod --reference=file3 file3.lock
    mv file3.lock file3
    } >file3.lock
    Code:
    #!/usr/bin/tcc -run
    /*BINFMTC: -O -Wuninitialized -Werror -Wfatal-errors -pedantic-errors
    csv2txt.c - A Microsoft CSV reader - Andrew Buckeridge
    1997,Ford,E350,"Super, ""luxurious"" truck"^M
    1997,Ford,E350,"Go get one now^M
    they are going fast"^M
    http://en.wikipedia.org/wiki/Comma-separated_values
    http://tools.ietf.org/html/rfc4180
    http://www.iana.org/assignments/medi...parated-values
    */
    #include <stdlib.h>
    #include <stdio.h>
    
    int main()
    {
            int c, o, inquot;
    
            for (inquot=0,o=' ',c=getchar_unlocked();c!=EOF;o=c,c=getchar_unlocked()) {
                            switch (c) {
                            case '\t':
                                    c=' ';
                                    break;
                            case '\"':
                                    inquot=!inquot;
                                    if (o=='\"') { /* This one an escape */
                                            putchar_unlocked(c);
                                            c=' '; /* Next one a quote */
                                    }
                                    continue;
                            case ',':
                                    if (!inquot)
                                            c='\t';
                                    break;
                            case '\r':
                                    continue;
                            case '\n':
                                    if (inquot)
                                            c=' ';
                            }
                            putchar_unlocked(c);
            }
    
            exit(0); /* fflush() <stdio.h> */
    
            return 0; /* Not reached, but main() returns this */
    }
    Last edited by zagam; 06-30-2015 at 05:45 AM. Reason: data loss ;(

  8. #8
    You crack me up. Do even read the original posts when you respond? The guy had a text file where the fields were delineated by "/" and he wanted to change it to "," and then presumably load it into excel (OP: come to think of it, if you are using Excel, go to the "data" tab, click on "import text file" then when following the wizard make "/" the column separator, you don't even need to preprocess the file).

    How in any way is that "dangerous"? In fact there is nothing dangerous about csv files anyway. Any programmer with half a brain can write a CSV parser that can't be used to do any sort of buffer injection or virus loading.

    Quote Originally Posted by zagam View Post
    The problem with CSV is that it uses commas, which are normal printable characters. Then it hides them with quotes that are also normal printable characters. Quotes and escapes are harmful and are an attack vector. I have actually exploited the C <string.h> and <stdio.h> '\0' with real programs.

    In Unix you have tab as a field separator FS and line feed as a record separator RS. If you exclude these non printing characters from your data they will always be valid FS or RS. My C program below replaces any FS or RS with a space when it converts CSV to TSV.

    With Unix data files do not need to load data before you can use it as is the case with XML, S-Expressions, CSV, etc.

    If sorted with sort(1) you can query with the look(1) command. Unix is a 4GL in that it operates on sets of data rather than elements. It also has file security to avoid severe data loss. Read the man pages.

    Code:
    LC_COLLATE=C
    export LC_COLLATE
    # Exact
    look -b "$KEY^I" file
    # Partial
    look -b "$KEY" file. 
    # Use join(1) and cut(1)
    look -b "$KEY^I" file1 | join -t'^I' - file2 | cut -d'^I' -f1-3
    # Use comm() for other set operations
    # use sort() to sort and or merge data
    sort -m -t'^I' -k1 file1 file2 >file3
    # This can be done in a lock with umask(1)
    # Read only lock file created open for writing
    # Write new data in lock file
    # Make readable and then move in to place
    umask 0226
    {
    sort -m -u -t'^I' -k1 file1 file2
    chmod --reference=file3 file3.lock
    mv file3.lock file3
    } >file3.lock
    Code:
    #!/usr/bin/tcc -run
    /*BINFMTC: -O -Wuninitialized -Werror -Wfatal-errors -pedantic-errors
    csv2txt.c - A Microsoft CSV reader - Andrew Buckeridge
    1997,Ford,E350,"Super, ""luxurious"" truck"^M
    1997,Ford,E350,"Go get one now^M
    they are going fast"^M
    http://en.wikipedia.org/wiki/Comma-separated_values
    http://tools.ietf.org/html/rfc4180
    http://www.iana.org/assignments/medi...parated-values
    */
    #include <stdlib.h>
    #include <stdio.h>
    
    int main()
    {
            int c, o, inquot;
    
            for (inquot=0,o=' ',c=getchar_unlocked();c!=EOF;o=c,c=getchar_unlocked()) {
                            switch (c) {
                            case '\t':
                                    c=' ';
                                    break;
                            case '\"':
                                    inquot=!inquot;
                                    if (o=='\"') { /* This one an escape */
                                            putchar_unlocked(c);
                                            c=' '; /* Next one a quote */
                                    }
                                    continue;
                            case ',':
                                    if (!inquot)
                                            c='\t';
                                    break;
                            case '\r':
                                    continue;
                            case '\n':
                                    if (inquot)
                                            c=' ';
                            }
                            putchar_unlocked(c);
            }
    
            exit(0); /* fflush() <stdio.h> */
    
            return 0; /* Not reached, but main() returns this */
    }

Similar Threads

  1. MP4 Format
    By mike in forum Computers
    Replies: 3
    Last Post: 12-12-2008, 05:05 AM
  2. Will Someone help me format a website?
    By JenJen in forum Computers
    Replies: 0
    Last Post: 11-18-2007, 12:18 PM
  3. Howto Powerchair
    By Chess in forum Equipment
    Replies: 4
    Last Post: 07-03-2006, 04:31 PM
  4. JPG format
    By Dominique in forum Computers
    Replies: 1
    Last Post: 08-28-2005, 10:39 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •