Site Sponsors:
Downloading Wikiquote Quotations  
Getting things done can be all about inspiration?

(the classic logo)


When it comes to data collection efforts, the spark-for-today arrived with the realization that far too many of the citations collected by Wikiquote are absolute garbage.

It seems that "Quotations are only as good as the people who share them." (... and you can quote me on that one) (lol.)

UN-censorship


To aid in the trash-identification effort, one must also understand that one man's weeds are another man's salad. For that reason alone, moving forward the intent is that quotations be classified - never deleted.

Why keep the bad ones? -Because freedom of speech means that both sage and fool must learn to live in their own droppings. (ahem - that quote is nothing compared to the horrific experience many will have reviewing those Wikiquotes!)

Strategy


Wikipedia presently has over 150,000 wanna-be pop-sayings to triage here, friends... yet wee Piranhas consume impossibly-sized meals one byte at a time ...

So yes, perhaps wee 'Quoties be few & far between... but the inspiration de-jure is to allow our handful to exchange NEW quotes - as well as classification meta-data for the others - on such an impossible-sized collection of willfully corrupted, would-be inspiration.

File Format


Today's shared data file is WikiQuote_Data.zip.

From the documentation therein:

The name of the file indicates the date that the data were collected.

File Format
===========
(1) Each line contains a single citation ("quote.")

(2) Columns in each line are TAB separated.

(3) There are a least 3 columns per line:

Field #1: Overall Classification
--------- Default is "Unknown"
Doctor Quote Uses: My_Favorite(5), Very_Good(4), Good(3), Not_Good(2), Deleted(1), Unknown(0)
Field #2: Citation Hash Code
--------- Used to uniquely identify each citation.
The default is a Winzip-style, zero-weighted CRC32.
Field #3: Citation
--------- The quote, as harvested from Wikipedia.org
Note that HTML line breaks and quotes have replaced newlines & unary-quotations
Field 4+: Page References(s)
--------- The location(s) where each quotation was found on Wikipedia.org
Technically unlimited, yet typically containing only one page-reference.

...


Software Assistance


To aid us all in our quote-collection quests, enthusiastic 'Quoties might be enticed by our new quote-collection software effort.

Designed to import, export, collect, classify, and to share the updates between several 'webless 'Quoties across the aforementioned WikiQuote_Data.zip file, our plans are to share a completely new "Doctor Quote" on Github as our time & resources may allow.

Ode To RAD


Rather that C/C++ as used previously, the new version of Doctor Quote is being written in Java. --Mostly a proof of concept, note that future efforts will use C++ 2017, as well as an appropriate GUI toolkit.

For the sake of completeness, we should probably also note that these data were initially collected using Python 3.

p.s.


IF you have read this far, then you might have what it takes to be one of the FEW ... the INSPIRED ... the 'Quoties?

If you think that you can endure the tedium, the deliberately offensive quotes, and the brainstorming involved while helping a group of 'litheads with such a clean-up effort, then feel free to CLICK HERE to group-up with us!


[ add comment ] ( 45 views )   |  permalink  |  related link
An SQLite Database For Wikiquotes 
Once we have any significant collection of data, the next natural thing to do is obvious ...

So here is a link to it!

Please note that:

(1) The database file is for sqlite3. It is ~50MB.
(2) Of the original 5,225, there are 4,544 populated topics ("pages").
(3) Of the original 153,621, there are 146,545 size-filtered citations ("quotes"). I estimate that as much as 80% of them are completely fatuous?
(4) Newlines embedded in quotes are encoded as "<br>".
(5) Single quotes are encoded as "&#39;".
(6) To eliminate duplications, the quotes.ID is a 'Pythonic:
zlib.crc32(bytes(quote, "utf8"), 0)
(6) The `pages` table obviously relates to the `quotes` table via the `quotes.ID`.

Also:
sqlite> .schema
CREATE TABLE pages (ID integer primary key not null, page text, quote_id integer);
CREATE TABLE quotes (ID integer primary key not null, quote text);
sqlite> select count(*) from pages;
153546
sqlite> select count(*) from quotes;
146545
sqlite> select count(*) from (select distinct page from pages);
4544
sqlite>

I share this database in the hopes that a genuine 'quotie will help the planet by selecting their favorite quotes from this locus ... and share them with others!

(... and you can bet your *blippy* that I will be doing the same!)


Sharing is caring,

--Randall




[ add comment ] ( 63 views )   |  permalink  |  related link
Collecting Wikiquote Data Using Python 
You've 'gotta love collecting quotes - not only might they teach us, but reviewing quotations revered by others helps us better understand what motivates today's majorities.

Like many others, I also love Python 3. Not only is Python 3 finally ready for prime-time, but - from gainful employment to games - Python's community is simply the most amazing set of programing enthusiasts in our modern world. -If you want to do something, chances are extremely good that someone has a package that can help you do things ALLOT quicker.

So it was with collecting Wikiquotes!

Quotes Matter


I have been collecting quotes since my college days. Indeed, from then to prior to to-date I have amassed a collection of around 100,000.

When it came time to snoop around Wikiquote therefore, how could any 'quotie worthy of the moniker NOT try to collect 'em all, as well?

So as I sat down to "learn something" on this traditional occidental day of rest, I decided to give the wikiquotes package a try.

After pip'ing it down, here is what I came up with:

import wikiquotes

alpha = "abcdefghijklmnopqrstuvwxyz1234567890"
major = 1
minor = 1
with open("./wikiquote_2017_10_22.txt", "w") as results:
for char in alpha:
try:
result = wikiquotes.search(char, "english")
zlist = list(result)
for author in zlist:
print(char, major, author)
quotes = wikiquotes.get_quotes(author, "english")
for quote in quotes:
if str(quote).find("proposed by") == 0:
continue
if str(quote).find("(UTC)") != -1:
continue
print("tbd", char, minor, major, author, quote, sep='|', file=results)
minor += 1
major += 1
except:
print("error", char, minor, major, "error", "no quotes", sep='|', file=results)

Using the above, we were able to download 17,068 things to review. The fact that we have an even set of 360 'authors' (10 per) clearly indicates that I did not get 'em all the first time 'round... but I eventually got the vast majority [5,225 topics? 153,621 quotes?] of them... (*)

Quality Comments


Overall, I should note that I was disappointed with the quality of the quotations. While there were some decent citations that I did not have, allot of the jibes seem to be far too fatuous; desperate attempts to garner cheap publicity for far too many unmemorable nouns. More than a few pages have absolutely no quotations on them at-all.

Yet - as mentioned previously - as we 'quoties seek to separate the gold from the gall, over time history has an annoying tendency to insure that only the strong, will survive.

Enjoy the journey!

--Randall

p.s. If you would like to get the results of today's diversion, we just uploaded them to the Mighty Maxims Project.

(*) In order to keep the server load reasonable for our Wikipedia friends, I will keep THAT bit of code on my own 'local ... still sorting thru them! :)

[ add comment ] ( 53 views )   |  permalink  |  related link

<<First <Back | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | Next> Last>>